AIBot - Competitive Analysis Automation Tool

An automated system for gathering, processing, and analyzing competitive benchmarking data about Intel and AMD processors from web reviews and YouTube videos using AI/ML techniques.

Documentation

📖 Architecture Guide - System architecture, data flow, and pipeline details
🔒 Security Guide - Credential management and security best practices
📊 Training Data Summary - Model and dataset information

Overview

AIBot automates the process of:

Searching for Intel and AMD processor reviews on the web and YouTube
Scraping web pages and transcribing YouTube videos
Extracting benchmark data from review images using Google Cloud Vision OCR
Using GPT-4 to summarize benchmarking results
Identifying and classifying barplot images (benchmark performance charts) using a trained neural network
Collating benchmark data across multiple sources
Generating competitive analysis reports in Excel format

Features

Automated Web Scraping: Searches and scrapes processor reviews from trusted tech websites
YouTube Processing: Downloads, transcribes, and extracts frames (every 5 seconds) from YouTube review videos
ML-Based Image Classification: Uses a pre-trained MobileNetV2 neural network to identify benchmark charts
OCR: Extracts text and benchmark data from images using Google Cloud Vision API
AI Summarization: Leverages GPT-4 to intelligently summarize benchmark findings and test conditions
Data Collation: Aggregates and organizes benchmark data from multiple sources
Excel Reporting: Generates comprehensive competitive analysis reports

💡 See Architecture Guide for detailed system design and data flow diagrams

Technology Stack

Python 3.9.18
Machine Learning: TensorFlow 2.11, PyTorch 1.13
AI Services: OpenAI GPT-3.5/4, Google Cloud Vision & Speech APIs
Web Scraping: Selenium, BeautifulSoup, Requests
Video Processing: Pytube, MoviePy
Data Processing: Pandas, NumPy, SciPy, Scikit-learn
Image Processing: OpenCV, ImageIO, EasyOCR

Project Structure

aibot/
├── src/
│   └── aibot/                      # Main package
│       ├── aibot_main.py           # Main entry point
│       ├── web_search.py           # Web search functionality
│       ├── video_search.py         # YouTube search
│       ├── scrape_web.py           # Web scraping
│       ├── process_videos.py       # Video processing
│       ├── convert_image_to_text.py    # OCR functionality
│       ├── convert_speech_to_text.py   # Audio transcription
│       ├── gpt_wrapper_web.py      # GPT integration for web
│       ├── gpt_wrapper_video.py    # GPT integration for videos
│       └── [30+ other modules]
├── data/
│   ├── models/
│   │   └── barplot_prediction.hdf5    # Pre-trained ML model
│   ├── test_data/
│   │   ├── images/                    # Test benchmark images
│   │   ├── video_frames/              # Test video frames
│   │   └── audio/                     # Test audio transcripts
│   └── model_performance/             # Training metrics & charts
├── tests/                             # Test files
├── scripts/                           # Utility scripts
├── docs/                              # Documentation
├── config.yml                         # Application configuration
├── requirements.txt                   # Python dependencies
├── environment.yml                    # Conda environment
├── setup.py                           # Package setup
├── .env.example                       # Environment variables template
├── .gitignore                         # Git ignore rules
└── README.md                          # This file

Installation

Prerequisites

Python 3.9.18 or higher
Conda (recommended) or pip
Google Cloud account with Vision and Speech APIs enabled
OpenAI API account
YouTube Data API key

Setup Instructions

Clone the repository
```
git clone <repository-url>
cd aibot
```

Create and activate a virtual environment

Using Conda (recommended):

conda env create -f environment.yml
conda activate aibot

Using pip:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Configure credentials (IMPORTANT - See Security section below)

Copy the example environment file:

cp .env.example .env

Edit .env and add your API keys:

OPENAI_API_KEY=your_openai_api_key_here
GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json
YOUTUBE_API_KEY=your_youtube_api_key_here

Configure application settings

Edit config.yml to customize:
- Processors to compare
- Preferred review sources
- GPT models to use
- Timeout settings
- Output directories
Install the package
```
pip install -e .
```

Usage

Command Line Interface

Run the complete pipeline:

aibot --xls-name CompetitiveAnalysis.xlsx --num-benchmarks 50

Parameters

--xls-name: Name of the output Excel file (default: CompetitiveAnalysisSummary.xlsx)
--num-benchmarks: Maximum number of benchmarks to process (default: 50)

Workflow

The bot executes the following steps sequentially:

Search web and YouTube for processor reviews
Prioritize preferred domains and channels
Scrape web content (15-minute timeout per URL)
Download and transcribe YouTube videos
Run ML model to identify benchmark charts
Extract text from images using OCR
Summarize benchmarks using GPT-4
Collate and deduplicate results
Generate Excel report
Clean up temporary files

Configuration

Processors

Edit config.yml to specify which processors to compare:

processors:
  intel:
    - Core i9-14900K
    - Core i7-13700K
  amd:
    - Ryzen 9 7950X
    - Ryzen 7 7800X3D

Preferred Sources

Customize which websites and YouTube channels to prioritize:

preferred_domains:
  - anandtech
  - tomshardware
  - pcmag

preferred_channels:
  - Dave2D
  - Hardware Canucks

GPT Models

Configure which GPT models to use for different tasks:

gpt_models:
  video_description_summary: gpt-3.5-turbo-0125
  test_conditions: gpt-4
  benchmark_summarization: gpt-4

Security

CRITICAL: API Credentials

NEVER commit API keys or credentials to version control.

This project requires three types of credentials:

OpenAI API Key: Get from https://platform.openai.com/api-keys
Google Cloud Service Account: Create at https://console.cloud.google.com/iam-admin/serviceaccounts
YouTube Data API Key: Get from https://console.cloud.google.com/apis/credentials

All credentials should be stored in the .env file (which is ignored by git) or as environment variables.

If You Forked This Project

IMPORTANT: If you cloned this from a public repository that previously contained exposed credentials:

The old API keys may have been compromised and should NOT be used
Generate new API keys from scratch
Review your API usage logs for any suspicious activity
Enable API key restrictions and quotas in the respective cloud consoles

See SECURITY.md for detailed credential rotation instructions.

Output

The tool generates an Excel file containing:

Processor comparison data
Benchmark results across different tests
Source information (URLs, publication dates)
Test conditions and configurations
Performance metrics and analysis

Testing

Tests are located in the tests/ directory:

# Run all tests
python -m pytest tests/

# Run specific test
python tests/test_search_web.py

Troubleshooting

Common Issues

1. Import errors

Ensure you've activated the virtual environment
Reinstall dependencies: pip install -r requirements.txt
Install package in development mode: pip install -e .

2. API authentication errors

Verify your .env file exists and contains valid credentials
Check that GOOGLE_APPLICATION_CREDENTIALS points to a valid JSON file
Ensure API keys haven't expired

3. Timeout errors during web scraping

Some sites may be slow or blocking automated requests
Adjust timeout values in config.yml
Check robots.txt compliance for the target site

4. Memory issues

Reduce max_benchmarks in config
Process fewer videos simultaneously
Close other memory-intensive applications

5. Model prediction errors

Verify barplot_prediction.hdf5 exists in data/models/
Check TensorFlow installation: python -c "import tensorflow; print(tensorflow.__version__)"

Performance Considerations

Processing Time: Full pipeline can take 2-6 hours depending on the number of sources
Memory Usage: Expect 4-8 GB RAM usage during video processing
API Costs: GPT-4 API calls can incur significant costs; monitor usage
Rate Limits: Respect API rate limits; the tool includes delays between calls

Development

Code Organization

Entry Point: src/aibot/aibot_main.py
Search Modules: src/aibot/web_search.py, src/aibot/video_search.py
Scraping: src/aibot/scrape_web.py, src/aibot/get_relevant_webpages.py
Media Processing: src/aibot/extract_frames_from_video.py, src/aibot/convert_speech_to_text.py
ML/AI: src/aibot/convert_image_to_text.py, src/aibot/barplot_prediction_after_training.py
GPT Integration: src/aibot/gpt_wrapper_web.py, src/aibot/gpt_wrapper_video.py
Output: src/aibot/final_spreadsheet_write.py
Trained Model: data/models/barplot_prediction.hdf5

Adding New Processors

Edit config.yml to add processor names
No code changes required - the tool uses dynamic search

Adding New Sources

Add domain/channel names to config.yml
Sources are automatically prioritized by the tool

Limitations

Platform-Specific: Signal-based timeouts work on Unix/macOS only (not Windows)
Sequential Processing: No parallel processing of multiple sources
English Only: OCR and text processing optimized for English content
API Dependencies: Requires active API subscriptions and internet connection

License

[Specify your license here]

Contributing

[Add contribution guidelines]

Support

For issues or questions:

Check the Troubleshooting section above
Review logs in the logs/ directory
Open an issue on GitHub

Changelog

Version 1.0 (Current)

Initial release
Automated web and video scraping
ML-based benchmark chart detection
GPT-4 integration for summarization
Excel report generation

Acknowledgments

Google Cloud Vision & Speech APIs for OCR and transcription
OpenAI for GPT models
TensorFlow/PyTorch for ML capabilities
Selenium and BeautifulSoup for web scraping

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
scripts		scripts
src/aibot		src/aibot
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
SECURITY.md		SECURITY.md
config.yml		config.yml
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

kar-ganap/processor-comp-aibot

Folders and files

Latest commit

History

Repository files navigation