Tutorial Prepared by: Joi Anderson (@joicodes) • View on Notion • Updated: Sept. 27, 2020
Finding an interesting data set:
A data set is a collection of data.
Data sets are created in many different ways. Some are based off of human observations or surveys, like the U.S. Census. Others may be machine-generated, like satellite forecast data.
The most common format for data sets is a spreadsheet or CSV. Let's aim to find a dataset that is formatted as a CSV.
Here is a list of sources for interesting data sets to explore:
👉🏽 For this workshop, we will be using: Hot 100 singles (1/1/2000 and 12/28/2019)
Understanding Your Data:
Before starting your analysis on the data set, let's take the time to first understand the data we are working with. So let's take a look at the data:
Observations:
- About: New Hot 100 singles from January 1, 2000 to December 28, 2019
- Data Source: Web scrapped from Billboard.com
- Size: 7,850 rows of data (i.e. 7,850 songs)
- The first row of my data contains column names.
- Columns:
Week
- The week the song entered the Billboard Hot 100EnterPosition
- The position the song entered the Billboard Hot 100Song
- Name of the songPerformer
- Name of the performer and features on the song.
Download Data
Export the Google Sheets file as a CSV and move it to our repository:
File > Download > Comma Separated Values (.csv, current sheet)
Rename the file hot100.csv
and add it to your repository.
👀 Here is how your data looks as a raw CSV file: Preview
Meet Pandas (Python Data Analysis Library)
pandas
is a Python library that give you a set of tool to do data analysis.
If want to work with big data sets, then pandas
is going to be your best friend. 👯♀️
Image from: Python Awesome
To install pandas
, in your Terminal write:
pip3 install pandas
After it installs, we can import it into our [main.py](http://main.py)
file:
import pandas as pd
Loading our data from CSV file
Now that we've imported pandas
, we are ready to read the CSV file into Python using read_csv()
from pandas
:
data_frame = pd.read_csv("hot100.csv")
To see if it worked, we can see the first few rows of the data by adding the following to our code:
print( data_frame.head() )
head()
gives us a snap shot of our data, by displaying the first few rows and columns of the data set.
You should an aggregated chart printed to the terminal like this:
We can also see the last rows of the data by using tail()
print( data_frame.tail() )
Meet Bar Chart Race
bar_chart_race
is an open source Python library that can be used to create animated bar and line chart races in Python. It's is built on top of two popular Python data analytics library: matplotlib
and plotly
. This library simplifies creating racing graph animation!
👉🏽 See repo
To install bar_chart_race
, in your Terminal write:
pip3 install bar_chart_race
After it installs, we can import it into our [main.py](http://main.py)
file:
import bar_chart_race as bcr
Install Dependency
brew install ffmpeg
If you decide that you want to create a gif animation, install Image Magick and Ghost Script
brew install imagemagick
brew install ghostscript
Transform data into 'wide' data
In order to create a racing bar chart, our data set must be in 'wide' form where:
- Each row represents a single period of time
- Each column holds the value for a particular category
- The index contains the time component
To transform our data set into wide form we would need:
- The index would be the week — using
Week
- Each column has a name an artists who had a Top 100 hit — using
Performer
- Each row should represent the cumulative count of songs by that week.
Here a rough sketch of how it would look :
We can transform to be 'wide' by creating a pivot table with pandas
:
wide_data = data_frame.pivot_table(index='Week', columns='Performer', aggfunc='count', fill_value=0).cumsum()
Here is what wide_data.head()
will print:
If you want to see the full output, check it out here.
Remove header
The pivot table generated a header for us that is not useful to us.
We can remove this header by using drop level()
:
wide_data.columns = wide_data.columns.droplevel(0)
Remove duplicate columns
If you look at the results, there are some duplicated columns:
wide_data = wide_data.loc[:,~wide_data.columns.duplicated()]
Create a subset
There are hundreds of artists with Billboard Hot 100 hits. Our graph would be wayyyy too big if we decided to make all artists race. Let's shorten our table to 5 columns to compare.
Rather than deleting the columns we are currently not using, we can create a subset with the columns we need with pandas
:
Let's choose 5 Performers (i.e. 5 columns of data) to race and store them in a list:
columns = [ "Mariah Carey", "Michael Jackson", "Drake", "Rihanna", "Lady Gaga"]
Using that list of column names, we can create a sub-dataset by doing the following:
sub_dataset = data_frame[columns]
Let's print the first few rows of sub_dataset
to what data it contains:
print(sub_dataset.head())
Now that we have our data ready... let the games begin!
Create .mp4 with Racing Bar Chart Animation
bcr.bar_chart_race(sub_dataset, filename='hot100.mp4')
Check out your video
Once your program has finished, check your repo for hot100.mp4
and watch your 5 artists race!
Which artists did you choose? Were you surprised about who won?
Here is mine (watch in 5x speed):
Check the docs for Bar Chart Race to customize your animation!