This project was implemented for the purposes of the undergrad. course Advamced Topics of Database Systems @ECE, NTUA, GR. Given a large dataset of movies and informations about them, the purpose of the exercise was to:
- Use spark framework to build queries about certain queries both in RDD API and SPARK SQL.
- Support use of .csv and .parquet files for the SQL queries
- Compare the time needed to get a response from thw query, for all possible setups (RDD/SQL) and .csv/.parquet (only in SQL).
=============================================================================================
- Create a function that implements repartition join
- Create a function tha implements broadcast join
- Compare running time of the above join on given data.
All queries were running on a cluster of two nodes (master/slave) each having 2GB RAM. The VM's were assigned by Okeanos project @NTUA.
Query Description will be uploaded in english and in greek :)