Spark - I come from a long line of Forty Niner fans. So my family forced me to write some queries to see exactly how great the Niners are. Here are a few Spark queries I wrote using this really cool dataset Detailed NFL Play-by-Play Data 2009-2018 posted to Kaggle by Max Horowitz.


PigQL - I used to collect data at baseball games when I was a kid. They used these large paper templates that would let you write down every action of every play. It was taken from pro methods for scouts and such. Needless to say, there are a lot of data available on baseball games, making it perfect for querying against large datasets. Here is some queries I wrote in Apache Pig on a large-ish dataset.


Hive - For measure, here are more queries on that same baseball dataset that I wrote in Hive.


Python Threading - Java would be more appropriate for true threading. But since the awesome Threading package exists in Python, here is a script that could be ran on a cluster in a web service to analyze some big unstructured datasets. This simple program iterates over txt files in a directory and returns a word index for each.