Purpose is to answer three questions based on data from a postgresql database called news There are three tables:
- articles
- Column:Type -> author:integer, title:text, slug:text, lead:text, body:text, time:timestamptz, id:integer
- 8 rows
- authors
- Column:Type -> name:text, bio:text, id:integer
- 4 rows
- log
- Column:Type -> path:text, ip:inet, method:text, status:text, time:timestamptz, id:integer
- 1,677,735 rows
- What are the most popular three articles of all time? Which articles have been accessed the most? Present this information as a sorted list with the most popular article at the top.
- Who are the most popular article authors of all time? That is, when you sum up all of the articles each author has written, which authors get the most page views? Present this as a sorted list with the most popular author at the top.
- On which days did more than 1% of requests lead to errors? The log table includes a column status that indicates the HTTP status code that the news site sent to the user's browser.
- Install VirtualBox from this link (Install package for your OS)
- Ubuntu users: If you are running Ubuntu 14.04, install VirtualBox using the Ubuntu Software Center instead. Due to a reported bug, installing VirtualBox from the site may uninstall other software you need.
- Install Vagrant
- Download it from this link
- Windows users: The Installer may ask you to grant network permissions to Vagrant or make a firewall exception. Be sure to allow this.
- Download this repository as a zip file and then unzip to wherever
- Open up a terminal and run
cd log-analysis
and thencd vagrant
- vagrant setup file will be in there
- Inside the vagrant subdirectory, run the command
vagrant up
- This will cause Vagrant to download the Linux operating system and install it
- May take a while (many minutes) depending on your internet connection speed
- When
vagrant up
finishes running and the shell prompt is back, runvagrant ssh
to log in to the Linux VM
- Files in the VM's /vagrant directory are shared with the vagrant folder on your computer. But other data inside the VM is not.
- For instance, the PostgreSQL database itself lives only inside the VM.
- If you exit the terminal or reboot the computer, you will need to run
vagrant up
to restart the VM
- Now that you are logged in, run
cd /vagrant
to gain access to the shared directory - Download the .sql file from here, unzip, put file into same directory as log-analysis.py
- Run
psql -d news -f newsdata.sql
to populate the database - Once the data is loaded, connect to the database using
psql -d news
and explore the tables using\dt
and\d table
commands andselect
statements. - Setup is complete! Now it is time to run the log analysis program I wrote.
- Make sure you are logged in to the VM
- cd into /vagrant:
cd /vagrant
- run
python log-analysis.py
- Each question will happen quickly thanks to the speed of SQL queries.
- Views were used for testing, but not in the final code.
- Subqueries were used in place of views instead.
- You can see some of the views in sql_query_notebook.txt
- sql_query_notebook.txt just contains queries that I wanted to have as notes.
- https://stackoverflow.com/questions/10660435/pythonic-way-to-create-a-long-multi-line-string
- https://stackoverflow.com/questions/120926/why-does-python-pep-8-strongly-recommend-spaces-over-tabs-for-indentation
- https://stackoverflow.com/questions/5648210/postgresql-self-join
- https://dba.stackexchange.com/questions/23778/count-and-sum-at-the-same-time-after-removing-duplicate-rows/23788#23788?newreg=fbc32b4ea43b4a108ac8722191727590
- https://stackoverflow.com/questions/5243596/python-sql-query-string-formatting
- https://stackoverflow.com/questions/3121979/how-to-sort-list-tuple-of-lists-tuples
- postgresql documentation
- https://stackoverflow.com/questions/16269394/how-to-get-count-and-percentage-comparing-two-tables-in-mysql
- https://stackoverflow.com/questions/7376072/postgresql-calculate-sum-of-a-resultset
- https://stackoverflow.com/questions/6832566/postgresql-group-by-timestamp-values