This project was my final assignment for Distributed Computing for Big Data, a postgraduate course in the LSE statistics department.
The goal of this project was to demostrate the use of PySpark
and Spark SQL
to query and analyze the Yelp Open Dataset. Specifically, I investigated the Yelp Reviews dataset, which consists of 6.7 million user-generated reviews of businesses on Yelp. I also performed JOIN operations with the Yelp Business and Yelp User datasets to describe relations between review ratings and characteristics of the business, such as geographic location. To perform some of these queries, I demonstrate the use of user-defined functions (UDFs) in Spark SQL queries. Lastly, I briefly examine how partitioning of the underlying data abstraction changes computational speed.