Skip to content

Final assignment for LSE course "Distributed Computing for Big Data"

Notifications You must be signed in to change notification settings

mdilmanian/PySpark-Yelp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Investigating the Yelp Open Dataset using PySpark and Spark SQL

This project was my final assignment for Distributed Computing for Big Data, a postgraduate course in the LSE statistics department.

The goal of this project was to demostrate the use of PySpark and Spark SQL to query and analyze the Yelp Open Dataset. Specifically, I investigated the Yelp Reviews dataset, which consists of 6.7 million user-generated reviews of businesses on Yelp. I also performed JOIN operations with the Yelp Business and Yelp User datasets to describe relations between review ratings and characteristics of the business, such as geographic location. To perform some of these queries, I demonstrate the use of user-defined functions (UDFs) in Spark SQL queries. Lastly, I briefly examine how partitioning of the underlying data abstraction changes computational speed.

About

Final assignment for LSE course "Distributed Computing for Big Data"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published