A machine learning project that classifies YouTube videos as educational or non-educational based on video metadata using Logistic Regression. It then uses these predictions to improve efficientcy on Youtube through the blocking of uneducational videos
This project implements a complete ML pipeline to automatically classify YouTube videos into educational (1) vs non-educational (0) categories. The classification is based on video titles, channel names, and other metadata features extracted from YouTube video data.
- Total Videos: 3,024 (after removing missing labels)
- Educational Videos: 1,170 (38.7%)
- Non-Educational Videos: 1,654 (54.7%)
- Features Used: Video title, channel name, video URL, and engineered numerical features
pip install -r requirements.txtjupyter notebook model/educational_video_classification.ipynbClarigo/
├── data/
│ ├── normalized/ # Cleaned JSONL files with labels
│ ├── processed_data/ # Master dataset (CSV & JSONL)
│ └── labeled_data/ # Original labeled data
├── preprocessing/ # Data processing scripts
├── model/
│ ├── dataset_builder.py # Combines normalized files into master dataset
│ └── educational_video_classification.ipynb # Main ML notebook
├── requirements.txt # Python dependencies
└── README.md # This file
The model uses both text and numerical features:
- Combined Text: Video title + channel name (preprocessed)
- TF-IDF Parameters: 5,000 max features, 1-2 ngrams, English stop words removed
- Title length (characters)
- Title word count
- Channel word count
- Presence of numbers in title
- Uppercase ratio in title
- Educational keywords count
The model recognizes educational content through keywords like:
tutorial, learn, education, course, lesson, teach, training, study, guide, how to, explained, basics, fundamentals, programming, coding, math, science, etc.
The final Logistic Regression model achieves:
- Accuracy: ~87-90% (varies with cross-validation)
- ROC-AUC: ~0.93-0.95
- Cross-validation: 5-fold CV with stratified sampling
- Hyperparameter Tuning: GridSearchCV optimized for ROC-AUC
C: 10.0 (regularization strength)penalty: 'l2'solver: 'liblinear'class_weight: 'balanced' (handles class imbalance)
import joblib
from pathlib import Path
# Load trained model and pipelines
model = joblib.load('models/logistic_regression_model.pkl')
text_pipeline = joblib.load('models/text_preprocessing_pipeline.pkl')
numerical_pipeline = joblib.load('models/numerical_preprocessing_pipeline.pkl')
# Predict a new video
title = "Python Programming Tutorial for Beginners"
channel = "CodeAcademy"
prediction, probability = predict_video_educational(title, channel, model, text_pipeline, numerical_pipeline)
print(f"Prediction: {'Educational' if prediction == 1 else 'Non-Educational'}")
print(f"Confidence: {probability[1]:.3f}")- Text Features Dominate: TF-IDF features from titles and channel names are the strongest predictors
- Educational Keywords: Words like "tutorial", "learn", "python", "explained" strongly indicate educational content
- Channel Context: Channel names provide valuable context (e.g., "CodeAcademy" vs "GamerHub")
- Title Patterns: Educational videos often have longer, more descriptive titles
- Class Imbalance: The dataset is slightly imbalanced (54.7% non-educational), handled with balanced class weights
- More Data: Collect additional labeled videos to improve performance
- Enhanced Features: Include video duration, view count, description text
- Advanced Models: Try Random Forest, XGBoost, or Neural Networks
- Better Text Processing: Use pre-trained embeddings (Word2Vec, BERT)
- Active Learning: Implement uncertainty sampling for efficient labeling
model/dataset_builder.py: Combines all normalized JSONL files into a master CSV datasetmodel/educational_video_classification.ipynb: Complete ML pipeline with EDA, feature engineering, training, and evaluationdata/normalized/: Clean, consistent JSONL files with video metadata and labelsdata/processed_data/master_dataset.csv: Final combined dataset ready for MLmodels/: Saved trained models and preprocessing pipelines (created after training)
Current Pipeline: YouTube Metadata web scraper → Primary cleaning → Human labeling → Data normalization → Machine Learning Classification
Due Date: September 2nd, 2025
This project is part of the Clarigo educational video classification system. To contribute:
- Ensure consistent data format in normalized files
- Run the dataset builder before training models
- Update documentation for any new features or improvements
- Test model performance with cross-validation
This project is for educational and research purposes.