Skip to content
This repository has been archived by the owner on Sep 5, 2022. It is now read-only.

Parsing Tutorial 1

Caleb Tung edited this page May 24, 2017 · 3 revisions

Overview

In this example we will parse the New York Traffic Camera website found here. This will be a good example of how to parse a website that uses the Google Maps API. Remember, websites is difficult because no website is the same. Every site is written by a different person using different methods. You must be creative. The best way to learn how to parse sites is to try it yourself.

Step 1: Identifying a Parsing Method for a Site using the Google Maps API

If the site has not been added to the database we must now see how we can extract the data from the website using BeautifulSoup4(BS4) or Selenium. Many traffic camera websites use the Google Maps API. You can tell if a site uses the G-Maps API if the website has a map with the Google logo in the bottom left hand corner (Example Below)

1-googlemapexample

Sites that use the G-Maps API will load a JSon or XML file holding the location data of the cameras. The G-Maps API uses this data to plot the camera points on the map. This file loads in the background as you load the page. We want to find the URL to this file. To file the URL, use the developer window in Chrome or Firefox (The below example uses Chrome). Open the developer window using the F12 key or by right clicking on a page element and selecting "Inspect Element". After opening the dev window go to the network tab and then select XHR tab (Shown Below). You may need to reload the page but you should see some files show up below the load timeline.

2-jsonexample

Look through the files that load and see if it contains any camera data similar to the preview pane in the above image. JSon files that contain camera data normally also contain an ID for each camera. We can try to use this ID to match the location information to the URL containing the image data. Often the camera ID appears somewhere in the URL associated with the webcam image data. If this is not the case, you'll have to look for some other way to match the location data to the image URL.

For this website, look at the image URL and see if it contains the camera ID from the JSon file. Notice that the image URL (http://207.251.86.238/cctv693.jpg) has "cctv693" in it. Let's check and see if the number after "cctv" matches the ID from the JSon file. Open the JSon file and look for the ID "693" using Ctrl+F. Unfortunately, there doesn't seem to be a camera with the ID matching "693" so we can't use that to match the location data to the image URL.

Next we look back at the original popup window and notice the URL has "cid=745" at the end. This could be the camera ID we are looking for. Search "745" the JSon file and it returns a camera this time as shown below.

3-cam745

To make sure they match check the content field. Notice the content field of the JSon file matches the description of the image seen on the popup window.

4-cam745compaison

Now that we can match the JSon data to the URL holding the image data. Now we must write a script to extract this data into an output file so it can be imported into the database.

Step 2: Parsing a Site that uses Google Maps API

Before you begin writing your script, please review the Coding Standards Page. There are specific requirements that must be followed when submitting code. First, fill out the header. Below is an example of a properly filled out header.

5-filloutheader

Next you must import the Python modules you need. The modules below are good to start with. We can add more later if need them. Define the main function and give it a descriptive name. Also, be sure to include the if __name__=='__main__': line. This line insures that the functions in your script can be called by other Python scripts.

6-importlib

In order to make sure we get every camera imported, use the JSon file and go through every camera ID. For this method we get all the location info from the JSon file. Below, we define 2 URL's one of them is the URL to the JSon file and the other is the popup URL. The latter will be used to match the camera ID to the image URL.

Open an output file for writing and then write the header in the format specified by the addCamera.py file, you may refer to the documentation for examples. The addCamera.py file is used to add the cameras to the database in the final step. Next we use urllib2 to open "JSonURL" and download the JSon data. Then parse it using the json python module.

(If you are using python 2.7 instead of python3+ use the codecs library and use codecs.open to open the file instead of fopen as python2.7 does not support unicode characters in fwrite with fopen like python3. eg: f = codecs.open('nycdot_cameralist','w', encoding='utf8'))

7-setupoutputreaddata

Look at the JSon file in your browser. Notice it's first key is the 'markers' key and all the camera data is located inside this key. In our script, navigate to the 'markers' key, then iterate through every key inside. Parse the JSon data into 5 variables. You can print all 5 variables to check it is correctly parsed.

Now have all the latitude, longitude, and description information. We still need to find the city, state, country, and snapshot_url. However, for this website, all the cameras exist in New York, NY so we know the city, state, and country. All we need to find is the snapshot_url now!

8-loadjsonkeysintovar

To get the snapshot_url, first try and parse the page defined as 'url' in the code above. Take a quick look at the page source for the popup window (below) we can easily identify the 'img' tag. Notice the src field contains the URL to the webcam image. This tag will be easy to parse because it is the first occurrence of 'img' tag in the page.

9-findingimagetag

The 'img' tag will be easy to find with BeautifulSoup so let's open up the page in our code and find the first instance of the tag 'img'. In the for loop, add the code to download the source HTML of the camera image popup using the urllib2 module. Send the HTML data to BeautifulSoup to parse it. Next we use soup.find() to find the first instance of the 'img' tag in the page source. Print the 'img' tag and see if it was correctly retrieved.

10-readimgbs4

Run the code to check it, the output is below. Notice the field we need to get from the code, the 'src' field is empty! Compare the other fields of the 'img' tag and they match but the 'src' field contents is missing.

11-srccheck

The 'src' field of the 'img' tag is probably populated by a JavaScript execution. When downloading the HTML source using urllib2 we only get the HTML. Urllib2 does not simulate a full browser rendering of the site. Which is fine under many scenarios. However, this site automatically refreshes the image every second or so using JavaScript. When we don't fully render the page in a browser the JavaScript doesn't run and the 'src' field will not be populated.

We can fix this by simulating a full browser rendering of the page before we extract the HTML source. PhantomJS will load the whole page in a headless browser environment. This means it will load the whole page as a browser would without displaying anything to the user. Now we can execute the JavaScript and get the 'src' field! We will add a few modules and notes (Highlighted below) to our code so that it can be better understood.

12-headercorrections

Now we have to make sure that the PhantomJS browser (Download Here) executable is located in the same directory as our script. We will also need to load Selenium to run and control PhamtomJS. We will add an if statement to our code to define the browser for Selenium as the "PhantomJS" or PhantomJS.exe (Windows) depending on the platform.system() variable from the "platform" module we imported above.

13-platform

We'll use Selenium to load the web page in PhantomJS and then we will extract the page data and send it to BeautifulSoup. Basically we replaced the urllib2 HTML read with a Selenium page load and then extracted the HTML source from that.

14-pageloadcorrection Now when we check our output we the image URL!

15-correctoutput

Now that we have parsed the snapshot_url most of our work is done! We can write all our information to the output file and prepare to add it to the database!

Finally we need to add a quick check to see if the webcam is active. When the webcam is inactive, src="img/inactive" we check this with the re module. Just before we write the snapshot_url we trim the "?math=(Random Number)" from the end as it is

16-checkcameraactive

Now let's run the script and check our output file to see that it has correctly written the information.

17-finaloutput

It looks great! You're Done! Congratulations you can parse cameras. This process will vary widely but this is one example just to get you started! Now we are ready for the final step of the process, adding the cameras to the database!

Before you add the cameras to the database you may want to read through the How to Use the Geocoder page for information on helper functions you may want to use in the future.