My Final Year Project for Integrated Computer Science. Written in python. Uses Stanford POS tagger and SentiWordNet 3.0
- NLTK 3.0
- Beautiful Soup 4.3.2
- PPR csv files must be placed in folder "../PPR" relative to the repository root
- SentiWordNet 3.0 text file (Although the code could be modified to use the nltk api, which would be preferrable)
- Stanford pos tagger 3.4.1 - also requires "english-bidirectional-distsim.tagger"
The following are guidelines for running different scripts in the project Make sure to install all dependencies listen above before continuing
- Install python 2.3+
- Download Beautiful Soup 4 (Instructions found here http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup). Make sure to get BeautifulSoup4-4.3.2!
- Install mysql
- Set up SQL tables "Threads" and "Posts" as described in the backup.sql files in the common folder of this repository.
- Change the settings in common/datamanager.py to work with your sql database
- Change the constant URL values in the postscraper.py file to the property pin pages that you want to scrape
- Run command "python postscraper.py"
- Note that logs are printed to "logs/scraper.log"
- Download all of the property price register csv files and place them in "../PPR" relative to this repository's root folder
- Run locationGen.py after scraping at least 1 thread
- Output is stored in adressIndex.json and addressLookupTable.json
- Run locationMatcher.py after completing the previous two tasks
- Output is stored in addressMatches.json
- Download StanfordPosTagger.jar
- Run sentimentAnalysis.py after completing the previous three tasks. (Make sure at least one entry exists in addressMatches.json or nothing will happen)
- Output is stored in sentimentAnalysis.csv
- Run aggregatePriceSent.py
- Output is stored in aggregatedData.csv
- Run aggregateData.py without any arguments
- Output is stored in aggregatedData.csv
- Run aggregateData.py with the argument "bigrams"
- Output is stored in aggregatedData.csv