Twitter users often associate and socialize with other users based on similar interests. The Tweets of these users can be classified using a trained LDA model to automate the discovery of their similarities.
Python 2.7 is recommended since the pattern library is currently incompatible with most Python 3 versions.
Python 3.6 can be used with the pattern library, though it may need to be built from source since most newer Linux distributions don't come with it pre-installed. The commands to build Python 3.6 from source are provided in the linux_setup_py3.6.sh script.
Download:
git clone https://github.com/kethort/twitter_LDA_topic_modeling.git
Run bash script:
./linux_setup_py3.6.sh
Python pip requirements included in these files:
# for Python 2.7
pip install -r requirements_py2.txt
# for Python 3
pip install -r requirements_py3.txt
Link to the simple-wikipedia dump:
https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2
The installation is very similar to the linux installation:
extra install instructions in osx_setup_py3.6.info
pip install -r requirements_py3_OSX.txt
- Get user and follower ids by location - twitter_user_grabber.py
- Download Tweets for each user - get_community_tweets.py
- Create an LDA model from a corpus of documents - create_LDA_model.py
- Generate topic probability distributions for Tweet documents - tweets_on_LDA.py
- Calculate distances between Tweet documents and graph them - plot_distances.py