A Machine Learning project to detect spam messages using Natural Language Processing (NLP), TF-IDF
vectorization, SMOTE
for imbalance handling, and a Logistic Regression
classifier — all wrapped up in Streamlit web app.
Model Accuracy: ~99% on test set
project-root/
│
├── data/
│ ├── spam.csv # Original dataset
│ ├── finalmodel.pkl # Trained ML model
│ ├── vectorizer.pkl # Saved TF-IDF vectorizer
│ ├── feature.pkl
│ └── label.pkl
│
├── notebook/
│ └── eda.ipynb # Exploratory Data Analysis
│
├── src/
│ ├── preprocessing.ipynb # Preprocessing pipeline
│ └── training.ipynb # Model training, tuning, evaluation
│
├── app.py # Streamlit app
└── README.md # You are here!
- Cleans & lemmatizes text
- TF-IDF vectorization
- Text length feature
- SMOTE for handling imbalanced classes
- Classifies using Logistic Regression
- Also tested with Multinomial Naive Bayes
- Built with reusability using joblib
- Streamlit app for user interaction
Python
Pandas
,NumPy
,Matplotlib
,Seaborn
scikit-learn
,imblearn
,nltk
Streamlit
joblib
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Logistic Regression | 99% | 0.99 | 0.99 | 0.99 |
MultinomialNB | 96% | 0.93 | 1.00 | 0.96 |
-
Clone the repo
git clone https://github.com/sarfraspc/spam-detector.git
-
Install requirements
pip install -r requirements.txt
-
Run the Streamlit app
streamlit run app.py
- Source:
spam.csv
- 5572 messages labeled as
ham
orspam
Sarfras LinkedIn
This project is open-source and available under the MIT License.