A weekday Twitter sentiment analysis on Tesla
The purpose of this project was to create a Twitter sentiment monitoring dashboard. By applying NLP, a dataset of over 40K pre-labeled tweets classified into positive, negative, and neutral was used to train and test different machine learning models. New tweets were collected via the Twitter API v2 and passed into the selected model to predict sentiment. The dashboard presents analysis of sentiment by weekday where chosen topic was Tesla.
- Data Source: SemEval-2017 Twitter data and Twitter API v2
- Language:
- Python, PySpark
- Dependecies: numpy, pandas, nltk, matplotlib, re, scikit-learn
- JavaScript, HTML, CSS
- Libraries: Bootstrap, D3, plotly, anychart
- Python, PySpark
- Software: Jupyter Notebook, VSCode, Google Colab, pgAdmin, AWS
- Code/Notebooks:
- Link to interactive dashboard
- Link to slides
- SemEval Dataset - The dataset contains tweets categorized to into three labels: positive, negative, and neutral. This data gathered was a result of an international workshop on semantic evaluation (SemEval) focusing on message polarity classification.
- API Tweets
- New tweets were collected via Standard Basic Twitter API v2.
- Endpoints used were search tweets and tweet counts in which one can retrieve public tweets posted and count of tweets for a query over the last 7 days.
- Data gathered was from Nov. 1-14, 2021 (two weeks). Total tweets is 67,200 (200 tweets per hour).
- A function was created to clean the tweets using regular expression to remove retweets, hyperlinks, hashtags, numbers, emojis, and mentions. NLP was applied by tokenizing the tweets, removing stop words, and lemmatizing the list of words by utilizing the NLTK library.
- A second function was created to apply the cleaning function, build the dataframe, and assign scores (1:positive, 0:neutral, -1:negative) to each tweet accordingly.
- Preprocessing of data: Using a .py file, the cleaning function was applied on the newly downloaded tweets from the Twitter API to be able to feed it to the model for sentiment prediction.
- Modeling: Three algorithms were tested and compared to see which model performed best. As part of the optimization, the neutral tweets were dropped to improve the model scores. Based on the results, model selected was LinearSVC with accuracy score of 0.83.
Model | Accuracy Score | Precision - Recall - F1 Score |
---|---|---|
Bernoulli Naive Bayes | 0.72 | |
Logistic Regression | 0.82 | |
Linear Support Vector Classification | 0.83 |
- The same cleaning function was applied to the tweets downloaded via the API to turn the tweets into list of words as a preprocessing step.
- The model and vectorizer used in the machine learning section were saved as pickle files to be able to import it and use it on the new set of tweets.
- The downloaded data, together with the predicted scores were merged to form dataframes to store in the database and saved it as a json file for the visualization.
- An AWS RDS instance was created and connected to pgAdmin where a new database was created.
- Tables were generated using a schema.
- Data files were uploaded to an S3 bucket where PySpark was used to extract and transform the data to match the tables in pgAdmin (see ETL process).
- Connection was made to the AWS RDS instance to write the dataframes created to the tables in the RDS.
- Date range of the week
- Total tweets occurred @tesla
- Average tweets occurred per day
- Tweets occurred per weekday
- Accumulated weekday tweets
- Number of weeks so far
- Overall sentiment distribution
- Positive rate
- Five positive tweets and five negative tweets
- Top 10 keywords
- Top 100 keywords cloud
- Line chart of sentiment distribution by hour
Tesla.Twitter.Sentiment.Dashboard.mp4
-
Rosenthal, S. et al. (n.d.). SemEval-2017 Task 4. ALT QCRI. https://alt.qcri.org/semeval2017/task4/
-
Edward, A. (2021, June 13). An Extensive Guide to collecting tweets from Twitter API v2 for academic research using Python 3. Towards Data Science. https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a
-
Goyal, G.(2021, June 11). Twitter Sentiment Analysis- A NLP Use-Case for Beginners. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/06/twitter-sentiment-analysis-a-nlp-use-case-for-beginners/
-
Effrosynidis, D. (2019). text-preprocessing-techniques. Github. https://github.com/Deffro/text-preprocessing-techniques
Other useful articles: