An automated system for gathering, processing, and analyzing competitive benchmarking data about Intel and AMD processors from web reviews and YouTube videos using AI/ML techniques.
- 📖 Architecture Guide - System architecture, data flow, and pipeline details
- 🔒 Security Guide - Credential management and security best practices
- 📊 Training Data Summary - Model and dataset information
AIBot automates the process of:
- Searching for Intel and AMD processor reviews on the web and YouTube
- Scraping web pages and transcribing YouTube videos
- Extracting benchmark data from review images using Google Cloud Vision OCR
- Using GPT-4 to summarize benchmarking results
- Identifying and classifying barplot images (benchmark performance charts) using a trained neural network
- Collating benchmark data across multiple sources
- Generating competitive analysis reports in Excel format
- Automated Web Scraping: Searches and scrapes processor reviews from trusted tech websites
- YouTube Processing: Downloads, transcribes, and extracts frames (every 5 seconds) from YouTube review videos
- ML-Based Image Classification: Uses a pre-trained MobileNetV2 neural network to identify benchmark charts
- OCR: Extracts text and benchmark data from images using Google Cloud Vision API
- AI Summarization: Leverages GPT-4 to intelligently summarize benchmark findings and test conditions
- Data Collation: Aggregates and organizes benchmark data from multiple sources
- Excel Reporting: Generates comprehensive competitive analysis reports
💡 See Architecture Guide for detailed system design and data flow diagrams
- Python 3.9.18
- Machine Learning: TensorFlow 2.11, PyTorch 1.13
- AI Services: OpenAI GPT-3.5/4, Google Cloud Vision & Speech APIs
- Web Scraping: Selenium, BeautifulSoup, Requests
- Video Processing: Pytube, MoviePy
- Data Processing: Pandas, NumPy, SciPy, Scikit-learn
- Image Processing: OpenCV, ImageIO, EasyOCR
aibot/
├── src/
│ └── aibot/ # Main package
│ ├── aibot_main.py # Main entry point
│ ├── web_search.py # Web search functionality
│ ├── video_search.py # YouTube search
│ ├── scrape_web.py # Web scraping
│ ├── process_videos.py # Video processing
│ ├── convert_image_to_text.py # OCR functionality
│ ├── convert_speech_to_text.py # Audio transcription
│ ├── gpt_wrapper_web.py # GPT integration for web
│ ├── gpt_wrapper_video.py # GPT integration for videos
│ └── [30+ other modules]
├── data/
│ ├── models/
│ │ └── barplot_prediction.hdf5 # Pre-trained ML model
│ ├── test_data/
│ │ ├── images/ # Test benchmark images
│ │ ├── video_frames/ # Test video frames
│ │ └── audio/ # Test audio transcripts
│ └── model_performance/ # Training metrics & charts
├── tests/ # Test files
├── scripts/ # Utility scripts
├── docs/ # Documentation
├── config.yml # Application configuration
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment
├── setup.py # Package setup
├── .env.example # Environment variables template
├── .gitignore # Git ignore rules
└── README.md # This file
- Python 3.9.18 or higher
- Conda (recommended) or pip
- Google Cloud account with Vision and Speech APIs enabled
- OpenAI API account
- YouTube Data API key
-
Clone the repository
git clone <repository-url> cd aibot
-
Create and activate a virtual environment
Using Conda (recommended):
conda env create -f environment.yml conda activate aibot
Using pip:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Configure credentials (IMPORTANT - See Security section below)
Copy the example environment file:
cp .env.example .env
Edit
.envand add your API keys:OPENAI_API_KEY=your_openai_api_key_here GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json YOUTUBE_API_KEY=your_youtube_api_key_here -
Configure application settings
Edit
config.ymlto customize:- Processors to compare
- Preferred review sources
- GPT models to use
- Timeout settings
- Output directories
-
Install the package
pip install -e .
Run the complete pipeline:
aibot --xls-name CompetitiveAnalysis.xlsx --num-benchmarks 50--xls-name: Name of the output Excel file (default: CompetitiveAnalysisSummary.xlsx)--num-benchmarks: Maximum number of benchmarks to process (default: 50)
The bot executes the following steps sequentially:
- Search web and YouTube for processor reviews
- Prioritize preferred domains and channels
- Scrape web content (15-minute timeout per URL)
- Download and transcribe YouTube videos
- Run ML model to identify benchmark charts
- Extract text from images using OCR
- Summarize benchmarks using GPT-4
- Collate and deduplicate results
- Generate Excel report
- Clean up temporary files
Edit config.yml to specify which processors to compare:
processors:
intel:
- Core i9-14900K
- Core i7-13700K
amd:
- Ryzen 9 7950X
- Ryzen 7 7800X3DCustomize which websites and YouTube channels to prioritize:
preferred_domains:
- anandtech
- tomshardware
- pcmag
preferred_channels:
- Dave2D
- Hardware CanucksConfigure which GPT models to use for different tasks:
gpt_models:
video_description_summary: gpt-3.5-turbo-0125
test_conditions: gpt-4
benchmark_summarization: gpt-4NEVER commit API keys or credentials to version control.
This project requires three types of credentials:
- OpenAI API Key: Get from https://platform.openai.com/api-keys
- Google Cloud Service Account: Create at https://console.cloud.google.com/iam-admin/serviceaccounts
- YouTube Data API Key: Get from https://console.cloud.google.com/apis/credentials
All credentials should be stored in the .env file (which is ignored by git) or as environment variables.
IMPORTANT: If you cloned this from a public repository that previously contained exposed credentials:
- The old API keys may have been compromised and should NOT be used
- Generate new API keys from scratch
- Review your API usage logs for any suspicious activity
- Enable API key restrictions and quotas in the respective cloud consoles
See SECURITY.md for detailed credential rotation instructions.
The tool generates an Excel file containing:
- Processor comparison data
- Benchmark results across different tests
- Source information (URLs, publication dates)
- Test conditions and configurations
- Performance metrics and analysis
Tests are located in the tests/ directory:
# Run all tests
python -m pytest tests/
# Run specific test
python tests/test_search_web.py1. Import errors
- Ensure you've activated the virtual environment
- Reinstall dependencies:
pip install -r requirements.txt - Install package in development mode:
pip install -e .
2. API authentication errors
- Verify your
.envfile exists and contains valid credentials - Check that
GOOGLE_APPLICATION_CREDENTIALSpoints to a valid JSON file - Ensure API keys haven't expired
3. Timeout errors during web scraping
- Some sites may be slow or blocking automated requests
- Adjust timeout values in
config.yml - Check
robots.txtcompliance for the target site
4. Memory issues
- Reduce
max_benchmarksin config - Process fewer videos simultaneously
- Close other memory-intensive applications
5. Model prediction errors
- Verify
barplot_prediction.hdf5exists indata/models/ - Check TensorFlow installation:
python -c "import tensorflow; print(tensorflow.__version__)"
- Processing Time: Full pipeline can take 2-6 hours depending on the number of sources
- Memory Usage: Expect 4-8 GB RAM usage during video processing
- API Costs: GPT-4 API calls can incur significant costs; monitor usage
- Rate Limits: Respect API rate limits; the tool includes delays between calls
- Entry Point:
src/aibot/aibot_main.py - Search Modules:
src/aibot/web_search.py,src/aibot/video_search.py - Scraping:
src/aibot/scrape_web.py,src/aibot/get_relevant_webpages.py - Media Processing:
src/aibot/extract_frames_from_video.py,src/aibot/convert_speech_to_text.py - ML/AI:
src/aibot/convert_image_to_text.py,src/aibot/barplot_prediction_after_training.py - GPT Integration:
src/aibot/gpt_wrapper_web.py,src/aibot/gpt_wrapper_video.py - Output:
src/aibot/final_spreadsheet_write.py - Trained Model:
data/models/barplot_prediction.hdf5
- Edit
config.ymlto add processor names - No code changes required - the tool uses dynamic search
- Add domain/channel names to
config.yml - Sources are automatically prioritized by the tool
- Platform-Specific: Signal-based timeouts work on Unix/macOS only (not Windows)
- Sequential Processing: No parallel processing of multiple sources
- English Only: OCR and text processing optimized for English content
- API Dependencies: Requires active API subscriptions and internet connection
[Specify your license here]
[Add contribution guidelines]
For issues or questions:
- Check the Troubleshooting section above
- Review logs in the
logs/directory - Open an issue on GitHub
- Initial release
- Automated web and video scraping
- ML-based benchmark chart detection
- GPT-4 integration for summarization
- Excel report generation
- Google Cloud Vision & Speech APIs for OCR and transcription
- OpenAI for GPT models
- TensorFlow/PyTorch for ML capabilities
- Selenium and BeautifulSoup for web scraping