This project aims to detect spam emails using natural language processing (NLP) techniques and machine learning. The dataset contains emails labeled as spam or non-spam, and the goal is to build a model that can classify emails accurately.
The dataset used in this project is a CSV file containing email data with the following columns:
label_num
: 1 for spam, 0 for non-spamtext
: The content of the email
- Column Renaming: The columns are renamed for better understanding.
- Handling Missing and Duplicate Values: Missing values are checked, and duplicates are removed.
- Text Preprocessing: The email content is preprocessed by converting to lowercase, removing non-alphanumeric characters, removing stop words, and stemming the words.
New features are added for better analysis:
num_characters
: Number of characters in the emailnum_words
: Number of words in the emailnum_sentences
: Number of sentences in the email
- Text Vectorization: The text is transformed using TfidfVectorizer.
- Train-Test Split: The dataset is split into training and testing sets.
- Model Selection: A Multinomial Naive Bayes model is trained on the training set.
The model is evaluated using accuracy, confusion matrix, and classification report. A confusion matrix is also plotted for better visualization.
-
Clone the repository:
git clone https://github.com/yourusername/spam-mail-detection.git cd spam-mail-detection
-
Install the required packages:
pip install -r requirements.txt
-
Download NLTK data:
import nltk nltk.download('stopwords') nltk.download('punkt')
- Run the preprocessing and model training script:
python train_model.py
The results of the model evaluation are printed in the console, including accuracy, confusion matrix, and classification report. The confusion matrix is also visualized using a heatmap.
Contributions are welcome! Please open an issue or submit a pull request.