This project aims to classify tweets into two categories: politics and sports. The classification is done using machine learning techniques, specifically logistic regression and XGBoost models. The project follows a structured workflow that includes data preprocessing, exploratory data analysis (EDA), text vectorization using TF-IDF, and model training and evaluation.
The project is organized into several main sections:
-
Data Preprocessing:
- Read the raw data from the dataset.
- Check for and handle any null or missing values.
- Perform data cleaning, including removing special characters and unwanted symbols.
- Tokenize the text data into individual words.
- Convert all words to lowercase.
- Remove stop words to reduce noise in the data.
- Lemmatize words to reduce inflections.
-
Exploratory Data Analysis (EDA):
- Check the balance between the two classes (politics and sports) in the dataset.
- Visualize the class distribution using plots (e.g., bar charts).
- Analyze the frequency of words within each label using word clouds or bar plots.
-
Text Vectorization (TF-IDF):
- Convert the preprocessed text data into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization.
-
Modeling:
- Train a logistic regression model on the TF-IDF transformed data.
- Train an XGBoost model as an alternative classifier.
-
Evaluation:
- Evaluate the performance of both models using appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score).
-
Conclusion:
- Summarize the results and insights gained from the project.
- Reflect on the effectiveness of the models and suggest potential improvements.
- Clone this repository to your local machine.
- Set up the required environment by installing the necessary libraries and dependencies [ pip install -r requirements.txt ]
- Run the Jupyter Notebook
deep_tweets_classification.ipynb
to execute the project pipeline.
Provide instructions on how to run the project and any relevant code snippets.
- Open the Jupyter Notebook
deep_tweets_classification.ipynb
. - Follow the step-by-step instructions to execute each code cell.
- Review the EDA plots, model training process, and evaluation results.
We welcome your feedback, suggestions, and questions! Whether you have ideas for improvements or questions about the project.
Made with Love 💌