GitHub - mdilmanian/PySpark-Yelp: Final assignment for LSE course "Distributed Computing for Big Data"

Investigating the Yelp Open Dataset using PySpark and Spark SQL

This project was my final assignment for Distributed Computing for Big Data, a postgraduate course in the LSE statistics department.

The goal of this project was to demostrate the use of PySpark and Spark SQL to query and analyze the Yelp Open Dataset. Specifically, I investigated the Yelp Reviews dataset, which consists of 6.7 million user-generated reviews of businesses on Yelp. I also performed JOIN operations with the Yelp Business and Yelp User datasets to describe relations between review ratings and characteristics of the business, such as geographic location. To perform some of these queries, I demonstrate the use of user-defined functions (UDFs) in Spark SQL queries. Lastly, I briefly examine how partitioning of the underlying data abstraction changes computational speed.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.ipynb_checkpoints		.ipynb_checkpoints
README.md		README.md
project.ipynb		project.ipynb
spark-sql-interfaces-60.PNG		spark-sql-interfaces-60.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Investigating the Yelp Open Dataset using PySpark and Spark SQL

About

Releases

Packages

Languages

mdilmanian/PySpark-Yelp

Folders and files

Latest commit

History

Repository files navigation

Investigating the Yelp Open Dataset using PySpark and Spark SQL

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages