A sophisticated news processing pipeline that combines AI-powered content extraction, advanced NLP techniques, and interactive data visualizations to provide comprehensive news analysis across multiple categories.
- Interactive charts and graphs using Plotly
- Heatmap visualizations for topic analysis
- Word cloud generation for keyword extraction
- Real-time analytics dashboard
- Pegasus-XSum: Advanced abstractive summarization
- T5-Base Fine-tuned: News title classification
- SpaCy: Named Entity Recognition (NER)
- NLTK: Text preprocessing and analysis
- NetworkX: Graph-based content analysis
- Multi-source news aggregation (NewsAPI, RSS feeds)
- Full content extraction with fallback mechanisms
- Topic classification across 8 categories
- Multi-document summarization
- Sentiment analysis and trend detection
- Business: Market trends, financial insights
- Entertainment: Media, arts, cultural analysis
- Health: Medical, wellness insights
- Science: Research, technological advances
- Technology: AI, innovation, digital trends
- Politics: Government, policy analysis
- Sports: Athletic events, team analysis
- World: International affairs, global news
- Python 3.8+: Main programming language
- Transformers (Hugging Face): State-of-the-art NLP models
- PyTorch: Deep learning framework
- Pandas & NumPy: Data manipulation and analysis
- Plotly: Interactive visualizations
- NetworkX: Graph analysis and visualization
-
google/pegasus-xsum (2.1GB)
- Abstractive text summarization
- Optimized for news content
- Generates concise, informative summaries
-
mrm8488/t5-base-finetuned-news-title-classification (850MB)
- Fine-tuned T5 model for news categorization
- 8-category classification system
- High accuracy on news title classification
-
en_core_web_sm (SpaCy)
- Named Entity Recognition
- Part-of-speech tagging
- Dependency parsing
- HTML5/CSS3: Modern, responsive interface
- JavaScript (ES6+): Interactive functionality
- Font Awesome: Icon library
- Google Fonts: Typography
- Python 3.8 or higher
- Git
- 4GB+ RAM (for model loading)
- 5GB+ free disk space
git clone https://github.com/yourusername/News-Extractor-Summarizer.git
cd News-Extractor-Summarizer
# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Create a .env
file in the root directory:
# Required API Keys
NEWS_API=your_news_api_key_here
The models will be automatically downloaded on first run:
python main.py
# Create cache directory
mkdir -p cache_dir/transformers
# Download Pegasus-XSum model
python -c "from transformers import PegasusForConditionalGeneration, AutoTokenizer; PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum', cache_dir='cache_dir/transformers'); AutoTokenizer.from_pretrained('google/pegasus-xsum', cache_dir='cache_dir/transformers')"
# Download T5-Base Fine-tuned model
python -c "from transformers import T5ForConditionalGeneration, AutoTokenizer; T5ForConditionalGeneration.from_pretrained('mrm8488/t5-base-finetuned-news-title-classification', cache_dir='cache_dir/transformers'); AutoTokenizer.from_pretrained('mrm8488/t5-base-finetuned-news-title-classification', cache_dir='cache_dir/transformers')"
# Download SpaCy model
python -m spacy download en_core_web_sm
# Run the complete pipeline
python main.py
# Or run individual components
python run.py
Open app/index.html
in your web browser or serve it using a local server:
# Using Python
python -m http.server 8000
# Using Node.js (if installed)
npx serve app
Models are configured in src/utils/config.py
:
# Summarization Model
SUMMARIZATION_CONFIG = {
'model_name': 'google/pegasus-xsum',
'max_length': 150,
'num_beams': 4,
'temperature': 1.0
}
# Topic Classification Model
TOPIC_MODELING_CONFIG = {
'model_path': 'cache_dir/transformers/mrm8488/t5-base-finetuned-news-title-classification',
'batch_size': 10
}
# News API Settings
NEWS_API_CONFIG = {
'query': 'business economy finance market...',
'language': 'en',
'sort_by': 'popularity',
'page_size': 100
}
News-Extractor-Summarizer/
βββ app/ # Frontend web interface
β βββ index.html # Main dashboard
β βββ script.js # Interactive functionality
β βββ style.css # Styling
βββ src/ # Backend source code
β βββ core/ # Core processing modules
β β βββ content_extractor.py
β β βββ graph_summarizer.py
β β βββ news_crawler.py
β β βββ summarizer.py
β β βββ topic_classifier.py
β βββ pipeline.py # Main orchestration
β βββ utils/ # Utilities and config
βββ cache_dir/ # Model cache (gitignored)
βββ dataset/ # Processed data (gitignored)
βββ output/ # Generated outputs (gitignored)
βββ main.py # Entry point
βββ run.py # Alternative runner
βββ requirements.txt # Dependencies
from src.pipeline import PipelineOrchestrator
# Initialize and run pipeline
orchestrator = PipelineOrchestrator()
success = orchestrator.run_pipeline()
from src.core.topic_classifier import TopicClassifier
classifier = TopicClassifier()
topic = classifier.classify_text("Your news article text here")
print(f"Classified as: {topic}")
from src.core.summarizer import MultiSummarizer
summarizer = MultiSummarizer()
summary = summarizer.summarize_text("Your long article text here")
print(f"Summary: {summary}")
- Visit NewsAPI.org
- Sign up for a free account
- Get your API key from the dashboard
- Add to
.env
:NEWS_API=your_key_here
# Clear cache and retry
rm -rf cache_dir/transformers/*
python main.py
- Reduce batch sizes in config
- Use smaller models for testing
- Ensure sufficient RAM (4GB+ recommended)
- Implement retry mechanisms
- Use multiple API keys
- Respect rate limits in configuration
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Hugging Face for the transformer models
- NewsAPI for news data
- SpaCy for NLP tools
- Plotly for visualizations
Made with β€οΈ by Tuhin-SnapD
Transform your news consumption with AI-powered insights and beautiful visualizations.