This is ETL process built for Sparkify project to analyze the data they've been collecting on songs and user activity on their new music streaming app
The fact table 'songplays' and dimension tables schema 'users','songs','artists' and 'time' are created in sql_queries.py
The ETL process in Notebook 'etl.ipynb' is implemented for a single JSON file in song and log data folder
'etl.py' ETL file is processed for all the files in song and log data folders
The final table 'songplays' fetches the song and artist information from song and log files
The 'song id' and 'artist id' from the first song data set which is loaded on the songs and artists tables is to load
the fact table, based on the values from the log file(song,artist and length), we get the corresponding values from the song file
Perform following steps to get data into the tables
- python create_tables.py
- run etl.ipynb for a single JSON file
- python etl.py
This will drop any existing tables and create the fact and dimension tables
This will insert records using single song and log JSON in all fact and dimension tables
Run test.ipynb to check if the data is inserted in the tables
This will process all the JSON files in the song and log data folders and insert records into fact and dimension tables
Run test.ipynb to check if all the data is inserted in the tables