Clarigo - Educational Video Classification

A machine learning project that classifies YouTube videos as educational or non-educational based on video metadata using Logistic Regression. It then uses these predictions to improve efficientcy on Youtube through the blocking of uneducational videos

Project Overview

This project implements a complete ML pipeline to automatically classify YouTube videos into educational (1) vs non-educational (0) categories. The classification is based on video titles, channel names, and other metadata features extracted from YouTube video data.

Dataset Statistics

Total Videos: 3,024 (after removing missing labels)
Educational Videos: 1,170 (38.7%)
Non-Educational Videos: 1,654 (54.7%)
Features Used: Video title, channel name, video URL, and engineered numerical features

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Train and Evaluate Model

jupyter notebook model/educational_video_classification.ipynb

Project Structure

Clarigo/
├── data/
│   ├── normalized/           # Cleaned JSONL files with labels
│   ├── processed_data/       # Master dataset (CSV & JSONL)
│   └── labeled_data/         # Original labeled data
├── preprocessing/            # Data processing scripts
├── model/
│   ├── dataset_builder.py        # Combines normalized files into master dataset
│   └── educational_video_classification.ipynb  # Main ML notebook
├── requirements.txt          # Python dependencies
└── README.md                # This file

Features Engineering

The model uses both text and numerical features:

Text Features (TF-IDF)

Combined Text: Video title + channel name (preprocessed)
TF-IDF Parameters: 5,000 max features, 1-2 ngrams, English stop words removed

Numerical Features

Title length (characters)
Title word count
Channel word count
Presence of numbers in title
Uppercase ratio in title
Educational keywords count

Educational Keywords

The model recognizes educational content through keywords like: tutorial, learn, education, course, lesson, teach, training, study, guide, how to, explained, basics, fundamentals, programming, coding, math, science, etc.

Model Performance

The final Logistic Regression model achieves:

Accuracy: ~87-90% (varies with cross-validation)
ROC-AUC: ~0.93-0.95
Cross-validation: 5-fold CV with stratified sampling
Hyperparameter Tuning: GridSearchCV optimized for ROC-AUC

Best Parameters (typical)

C: 10.0 (regularization strength)
penalty: 'l2'
solver: 'liblinear'
class_weight: 'balanced' (handles class imbalance)

Usage Examples

Predicting New Videos

import joblib
from pathlib import Path

# Load trained model and pipelines
model = joblib.load('models/logistic_regression_model.pkl')
text_pipeline = joblib.load('models/text_preprocessing_pipeline.pkl')
numerical_pipeline = joblib.load('models/numerical_preprocessing_pipeline.pkl')

# Predict a new video
title = "Python Programming Tutorial for Beginners"
channel = "CodeAcademy"

prediction, probability = predict_video_educational(title, channel, model, text_pipeline, numerical_pipeline)
print(f"Prediction: {'Educational' if prediction == 1 else 'Non-Educational'}")
print(f"Confidence: {probability[1]:.3f}")

Key Insights

Text Features Dominate: TF-IDF features from titles and channel names are the strongest predictors
Educational Keywords: Words like "tutorial", "learn", "python", "explained" strongly indicate educational content
Channel Context: Channel names provide valuable context (e.g., "CodeAcademy" vs "GamerHub")
Title Patterns: Educational videos often have longer, more descriptive titles
Class Imbalance: The dataset is slightly imbalanced (54.7% non-educational), handled with balanced class weights

Recommendations for Improvement

More Data: Collect additional labeled videos to improve performance
Enhanced Features: Include video duration, view count, description text
Advanced Models: Try Random Forest, XGBoost, or Neural Networks
Better Text Processing: Use pre-trained embeddings (Word2Vec, BERT)
Active Learning: Implement uncertainty sampling for efficient labeling

Files Description

model/dataset_builder.py: Combines all normalized JSONL files into a master CSV dataset
model/educational_video_classification.ipynb: Complete ML pipeline with EDA, feature engineering, training, and evaluation
data/normalized/: Clean, consistent JSONL files with video metadata and labels
data/processed_data/master_dataset.csv: Final combined dataset ready for ML
models/: Saved trained models and preprocessing pipelines (created after training)

Pipeline Overview

Current Pipeline: YouTube Metadata web scraper → Primary cleaning → Human labeling → Data normalization → Machine Learning Classification

Due Date: September 2nd, 2025

Contributing

This project is part of the Clarigo educational video classification system. To contribute:

Ensure consistent data format in normalized files
Run the dataset builder before training models
Update documentation for any new features or improvements
Test model performance with cross-validation

License

This project is for educational and research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
model		model
preprocessing		preprocessing
src		src
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
clarigo-notes.excalidraw		clarigo-notes.excalidraw
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Clarigo - Educational Video Classification

Project Overview

Dataset Statistics

Quick Start

1. Install Dependencies

2. Train and Evaluate Model

Project Structure

Features Engineering

Text Features (TF-IDF)

Numerical Features

Educational Keywords

Model Performance

Best Parameters (typical)

Usage Examples

Predicting New Videos

Key Insights

Recommendations for Improvement

Files Description

Pipeline Overview

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

r-chong/Clarigo

Folders and files

Latest commit

History

Repository files navigation

Clarigo - Educational Video Classification

Project Overview

Dataset Statistics

Quick Start

1. Install Dependencies

2. Train and Evaluate Model

Project Structure

Features Engineering

Text Features (TF-IDF)

Numerical Features

Educational Keywords

Model Performance

Best Parameters (typical)

Usage Examples

Predicting New Videos

Key Insights

Recommendations for Improvement

Files Description

Pipeline Overview

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages