Skip to content

AssignmentFour

Joe Hellerstein edited this page Dec 8, 2015 · 1 revision

Assignment 4

Notices and corrections

  1. Do this assignment as PARTNERS!
  2. PLEASE READ INSTRUCTIONS ENTIRELY before starting.
  3. You are allowed to be 2.5 hours later than the nominal deadline.
  4. Clarifications
  • Updated 5/24 11am: various formatting issues
  1. NOTICE!!! We have run out of space on /home/tmp/* -- until this is resolved, please only use the enron-small.sql dataset!
  2. Assignment 4 due date postponed to Friday 5/8 6pm (due to disk space issues).
  3. Demo and evaluation madness during section 5/11 in Soda 275. Feel free to come to any (and all!) of the 4 sections on Monday - 10-11am, 2-3pm, 3-5pm. This is optional but highly encouraged. Optional in the sense that the TAs will reluctantly grade autograde your project submission if you can't make it. Highly encouraged in the sense that TAs will be glad to hear your spin on what you did and see you share your results with other students.
  4. Clarified submission instructions. Turn in a hw4.tar file.

Overview

Hello gumshoes! Justice needs your help! The year is 2001 and the Enron Scandal is in full swing. The authorities have taken over Enron emails to find out: who are the Enron kingpins? They have started by implementing a basic database-driven web-application for search and visualization: EnronSearch.

In this assignment, you will help authorities find the Enron kingpins by scaling up EnronSearch to handle the large email dataset, and rendering the search results in a meaningful way. You will exercise your knowledge of boolean search, database tuning and SQL. You will need to write code in Ruby using the Rails framework, against a PostgreSQL database.

What we provide

We provide a working Ruby on Rails application providing search and visualization over the Enron emails:

  • Two sets (large and small) of emails populating a PostgreSQL table called Links:

    links(id:integer, to:string, from:string, body:text)

  • A Rails ActiveRecord subclass called Link that wraps the table, and provides access to its tuples as Ruby objects

  • Using the Link ActiveRecord model and find_by_sql API, we provide a naive boolean search over the emails using the SQL like construct. This approach will not scale to the full size of the Enron emails -- which you'll fix!

  • A PostgreSQL table and corresponding ActiveRecord subclass for the inverted index table:

    terms(id:integer, term:string, email_id:integer, pos:integer)

  • A basic visualization of search results. Example: Image(http://cs186berkeley.net/export/54/code/images/graph-ThuApr2312054407002009.png)

You can see an instance of the application here: [http://sphere.cs.berkeley.edu:3000]

Objective

Your objective: scale EnronSearch. There are many existing text search libraries out there (Lucene (Java), Ferret/Sphinx/Xapian/Solr (Ruby), tsearch (PostgreSQL)). Rather than relying on any of these packages, we will instead create a simple version in SQL.

Summary of required tasks:

  1. Populate the inverted index: fill the terms table so that it serves as an inverted index for the links table
  2. Implement boolean keyword search using the terms table/ActiveRecord. We provide an example search function in the links table/ActiveRecord
  3. Determine the database index(es) to add to the terms table to optimize its performance for boolean search
  4. Improve the visualization of search results

Optional tasks:

  1. Order the search results in a relevant way. Google has PageRank -- what will you do for EnronSearch? Note that the terms table has a pos field.
  2. Fix the pagination of search results. The lazy implementer of the current system left this open bug for you.

Tools for the assignment

Your course account has been set up with access to PostgreSQL 8.2.4 and all necessary Ruby/Rails binaries. We use Ruby 1.8.6 and Rails 2.1.2. In addition we use the will_paginate and ruby-graphviz gems. These were compiled for Solaris, so you must run on one of the following machines:

  • rhombus
  • pentagon
  • cube
  • sphere
  • po
  • torus If you have modified your PostgreSQL environment variables in your course account (e.g. in ~/.cshrc) you MUST remove those modifications or you may find that PostgreSQL will not work.

You can develop on your own machine if you like. You'll need to install the Ruby/Rails stack, gems and GraphViz. We don't support custom configurations and provide no additional details; be aware that we will test your queries on the version of PostgreSQL / Rails used by the class accounts, and you will not get credit unless it runs in that version. So it's wise to double-check your code on the class accounts before turning in.

Set up the software stack

First, clear your previous PostgreSQL setup from ~ directory: the src checkout, pgsql and pgdata. Then set up the database.

> /bin/rm -rf $PGDATA/
> initdb
> pg_ctl -l logfile start
> createdb
> psql # just to make sure it works

Check out the provided Rails app

> cd ~
> mkdir hw4
> cd hw4
> svn co file:///home/ff/cs186/sp09/hw4/enron

Create the configuration file ~/hw4/enron/config/database.yml by copying ~/hw4/enron/config/database.yml.example

cp ~/hw4/enron/config/database.yml.example ~/hw4/enron/config/database.yml

Edit ~/hw4/enron/config/database.yml and put in your class account name and postgres port. You can figure out your postgres port by typing 'echo $PGPORT' at the command line.

[inside database.yml]

development:
...
  username: # <replace with your user name cs186-xx>
...
  port: # <put here the results of 'echo $PGPORT'>

Create database and load data; start with the small dataset for testing

> cd ~/hw4/enron
> rake db:create
> rake db:migrate
> psql -d enron_development < ~/hw4/enron/data/enron-small.sql 

To start Rails, just type:

> cd ~/hw4/enron
> server 

This is aliased to "script/server webrick -p $WEBPORT" for you. n(You can find the value of the $WEBPORT variable by typing echo $WEBPORT at the command line.) Pay attention to the port, you'll need it to access the app: test that the web app is alive by visiting !http://:/. For example, if you are logged onto sphere, and your $WEBPORT number is 99943 you should visit http://sphere.cs.berkeley.edu:99943", and the you should see "Welcome to Enron!", etc.


Check out the NaiveSearch

The main logic for NaiveSearch is in enron/app/controllers/links_controller.rb::nsearch

You can try out the life-cycle of a search as follows:

  1. Visit !http://:/nsearch, type in search terms, for example: 'skillings lay' to search for the two executives
  2. The routing policy in enron/config/routes.rb will route the request to the links_controller mentioned above and the function nsearch
  3. Search terms are passed via HTTP POST in the params object. The nsearch function parses the query terms and generates a SQL statement to directly search for answers in the links table via the Link ActiveRecord object.

Given the search terms of "davaid foster wallace", the generated SQL looks like this:

SELECT * FROM "links" 
WHERE ((lower(links.body) like E'%david%' or lower(links.from) like E'%david%' or lower(links.to) like E'%david%') 
and (lower(links.body) like E'%foster%' or lower(links.from) like E'%foster%' or lower(links.to) like E'%foster%') 
and (lower(links.body) like E'%wallace%' or lower(links.from) like E'%wallace%' or lower(links.to) like E'%wallace%')) 
LIMIT 30 OFFSET 0
  1. Matching emails are returned by the nsearch to the search results portion of the webpage.
  2. Emails are also fed to the build_graph function in enron/app/controllers/application.rb, which generates an image file in enron/public/graph/graph-.png
  3. This image location is accessible over the web, since it is in the public folder. The image is included in the results

Implementing EnronSearch

Your tasks (and how you'll be graded)

  1. Populate the inverted index from the links table.
  • First go to !http://:/update_index - you'll find that this routes to the action enron/app/controllers/terms_controller.rb::update_index. In this method, you will see stub code and comments for you to fill in the logic to populate the inverted index. We provide a function for you to parse terms out of an email - you must use this routine to generate terms, so that the terms you end up with are the ones we expect in the grading scripts.
  1. Add the appropriate B+-tree index to the inverted index table.
  • According to boolean search, we will be doing many self-joins of the term table to intersect email_id lists.
  • Specify the DDL statement for CREATE INDEX ... in db/terms_index.sql. Run this sql file to install the index. The file is provided for you on SVN checkout, place your answer there.
  • We expect answers will vary widely. If you feel like your answer requires explanation, write a sentence or two in SQL comments (prefaced by {{--}}) in the .sql file you turn-in. We'll look at it as needed.
  1. Implement !http://:/search.
  • The route is already set up: requests to enron/app/controllers/terms_controller.rb::search.
  • Note that this is different from nsearch and lives in a different controller.
  • See stub code and comments indicating where to implement the search function in enron/app/controllers/term_controller.rb. You'll probably want to use the find_by_sql interface, and interact with both Term and Link models. Of course, you are not limited to terms_controller.rb.
  • Assume search terms should AND together, don't worry about OR / NOT, unless you want to!
  1. Modify the build_graph function in enron/app/controllers/application.rb to create a simple visualization that best captures the answer to the question who are the Enron kingpins?
  • Graphs are created based on a particular search result set.
  • We are looking for visual effectiveness and scalability. Try to search for a term like "enron" -- how will your visualization handle many results? Be creative! The only constraint is that the graph may contain at most 100 nodes.
  • This part won't be a significant portion of the grade, but aim to impress!
  1. When ready, load in the large dataset. How well do you scale?

    psql -d enron_development < ~/hw4/data/enron-large.sql

  • Your submission must build the inverted index for enron-large.sql in under 5 min in a scalable way. We will run your application on an otherwise unladen machine. We were able to build the index on sphere in a couple minutes. Note, you'll probably have to go around ActiveRecord in order to achieve reasonable performance. Originally the target time is 5 min, but on a shared server, you may not see this type of performance. We relax this constraint to with-in an order of magnitude of 5 minutes.
  • Your search must return valid results, quickly, for various queries
  • Your visualization should arguably aid users in the task of finding the kingpins of Enron.

Getting started tips

  • If you're experienced with Rails, dive right in. If not, you'll be up to speed in no time by checking out the Technology reference section below
  • You can find your postgres / rails server port with {{{

echo $PGPORT echo $WEBPORT }}}

  • Start with the small dataset. Then migrate to the larger one, once you've fleshed out your functionality.

What to turn in

You will turn in your entire enron/ application folder as a hw4.tar file

  1. Your submission must run after we run the following commands: rake db:create, rake db:migrate, load data
  2. enron/db/ must contain terms_index.sql which specifies the index(es) to make boolean search on terms go quickly

Technology reference

Ruby Ruby is a dynamic, open-source, scripting language. Although learning ruby is not a major focus of this class, you will need a cursory understanding in order to complete this assignment. We recommend looking at some of the following resources before you get started.

Since this is a new language for some people, please don't hesitate to ask the TAs or newgroup questions about ruby basics, as we don't want you to waste too much time fighting with the language (of course, after you search the web and read other forum postings).

Rails Ruby On Rails is an open-source web development framework written in Ruby. Its goal is to make it easy to rapidly design, implement and deploy database-backed, interactive websites. We will be providing you with a partially implemented site so that you can focus on the interesting parts and not on learning the ins and outs of Ruby/Rails. However, if you are interested in learning more there are a number of great quick start tutorials and screencasts just a web search away.

Links sent out to the mailing list, regurgitated here: A few of my favorite resources for learning Ruby/Rails:

Understanding MVC Rails is based around the Model-View-Controller design pattern. As Wikipedia describes it:

It is common to split an application into separate layers that run on different computers: presentation (UI), domain logic, and data access. In MVC the presentation layer is further separated into view and controller. MVC is often seen in web applications, where the view is the actual HTML page, and the controller is the code that gathers dynamic data and generates the content within the HTML. Finally, the model is represented by the actual content, usually stored in a database, and the business rules that transform that content based on user actions.

Though MVC comes in different flavors, control flow generally works as follows:

 1. ''The user interacts with the user interface in some way (e.g. presses a button).''
 2. ''A controller handles the input event from the user interface, often via a registered handler or callback.''
 3. ''The controller notifies the model of the user action, possibly resulting in a change in the model's state. (e.g. controller updates user's Shopping cart).''
 4. ''A view uses the model (indirectly) to generate an appropriate user interface (e.g. the view produces a screen listing the shopping cart contents). The view gets its own data from the model. The model has no direct knowledge of the view.''
 5. ''The user interface waits for further user interactions, which begins then a new cycle.''

By decoupling models and views, MVC helps to reduce the complexity in architectural design, and to increase flexibility and reuse.

You will see this functionality separated into directories in app/ in the skeleton provided.

There are three controllers:

  • application.rb - This is the parent controller for all other controllers in your app. You should not need to put any code here. You can see that we put some helper functions here.
  • links_controller.rb - This controller handles interactions with the Link model and underlying table. You can test them by visiting http://:/links/. We have implemented our naive search here, http://:/nsearch
  • terms_controller.rb - This controller has a number of functions. First it handles interactions with the Term model and underlying table, and serves as our inverted index. Comments are provided describing the specs for each function. You can test them by visiting http://:/terms/.

ActiveRecord ActiveRecord is an Object-Relational Mapping library used by RubyOnRails. It's fairly full featured and makes a number of assumptions about the layout of your schema. If you are familiar with this library feel free to use any basic (no plug-ins) features you. If not, don't worry, as we have provided a list of these conventions and overrides, along with few simple example below about using ActiveRecord to run arbitrarily complex database queries over the Boats/Sailors example from lecture.

Conventions:

  1. Tables are the plural of the class name for any active record instance. (I.e. class Sailor < ActiveRecord:Base would refer to data in the Sailors table)
  2. The primary key for any entity is id. a. You can override this by putting set_primary_key :key or set_primary_keys [:key1, :key2] at the top of the class definition
  3. Foreign keys are always named _id a. You can override this by specifiying :foreign_key => [:song] after the belongs_to or has_many

Example:

Sailors(sid, sname, rating, age)
Boats(bid, bname, color)
Reserves(sid, bid, day)

class Sailor < ActiveRecord:Base
    set_primary_key :sid
    has_many :reservations, :foreign_key => :sid
end

class Boat < ActiveRecord:Base
    set_primary_key :bid
    has_many :reservations, :foreign_key => :bid
end

class Reservation < ActiveRecord:Base
    set_primary_keys :sid, :bid, :day
    belongs_to :sailor, :foreign_key => :sid
    belongs_to :boat, :foreign_key => :bid
end

Make a new sailor:

s = Sailor.new
s[:sid] = 1
s[:sname] = "Michael"
s[:rating] = 10
s[:age] = 24
s.save

Find a sailor by name:

s = Sailor.find_by_sname("Michael")

Find all red boats:

b = Boat.find_all_by_color("red")

Find all reservations by a sailor:

s = Sailor.find_by_sname("Michael")
r = s.reservations

These are the basics, but sometimes you want to be able to write more complex queries. For this you can call back on full SQL. If you wanted all sailors who have reserved a boat today you could write:

s = Sailor.find_by_sql("SELECT DISTINCT sailor.* FROM sailors JOIN reservations ON sailors.sid = reservations.sid WHERE reservation.day = <today>")

You can even add computed columns. For example to find all sailors along with their reservation counts you could write:

s = Sailor.find_by_sql("SELECT sailor.*, count(*) as reservation_count JOIN reservations ON sailors.sid = reservations.sid GROUP BY sailors.sid")
s[0] #First sailor in the list
s[0][:sname] #name of the first sailor in the list
s[0][:reservation_count] #Reservation count for the first sailor in the list