Skip to content

Automated Competitive Analysis tool for Intel and AMD processors - scrapes web and YouTube reviews, uses ML to detect benchmark charts, and leverages LLMs to generate comparative analysis reports.

Notifications You must be signed in to change notification settings

kar-ganap/processor-comp-aibot

Repository files navigation

AIBot - Competitive Analysis Automation Tool

An automated system for gathering, processing, and analyzing competitive benchmarking data about Intel and AMD processors from web reviews and YouTube videos using AI/ML techniques.

Documentation

Overview

AIBot automates the process of:

  • Searching for Intel and AMD processor reviews on the web and YouTube
  • Scraping web pages and transcribing YouTube videos
  • Extracting benchmark data from review images using Google Cloud Vision OCR
  • Using GPT-4 to summarize benchmarking results
  • Identifying and classifying barplot images (benchmark performance charts) using a trained neural network
  • Collating benchmark data across multiple sources
  • Generating competitive analysis reports in Excel format

Features

  • Automated Web Scraping: Searches and scrapes processor reviews from trusted tech websites
  • YouTube Processing: Downloads, transcribes, and extracts frames (every 5 seconds) from YouTube review videos
  • ML-Based Image Classification: Uses a pre-trained MobileNetV2 neural network to identify benchmark charts
  • OCR: Extracts text and benchmark data from images using Google Cloud Vision API
  • AI Summarization: Leverages GPT-4 to intelligently summarize benchmark findings and test conditions
  • Data Collation: Aggregates and organizes benchmark data from multiple sources
  • Excel Reporting: Generates comprehensive competitive analysis reports

💡 See Architecture Guide for detailed system design and data flow diagrams

Technology Stack

  • Python 3.9.18
  • Machine Learning: TensorFlow 2.11, PyTorch 1.13
  • AI Services: OpenAI GPT-3.5/4, Google Cloud Vision & Speech APIs
  • Web Scraping: Selenium, BeautifulSoup, Requests
  • Video Processing: Pytube, MoviePy
  • Data Processing: Pandas, NumPy, SciPy, Scikit-learn
  • Image Processing: OpenCV, ImageIO, EasyOCR

Project Structure

aibot/
├── src/
│   └── aibot/                      # Main package
│       ├── aibot_main.py           # Main entry point
│       ├── web_search.py           # Web search functionality
│       ├── video_search.py         # YouTube search
│       ├── scrape_web.py           # Web scraping
│       ├── process_videos.py       # Video processing
│       ├── convert_image_to_text.py    # OCR functionality
│       ├── convert_speech_to_text.py   # Audio transcription
│       ├── gpt_wrapper_web.py      # GPT integration for web
│       ├── gpt_wrapper_video.py    # GPT integration for videos
│       └── [30+ other modules]
├── data/
│   ├── models/
│   │   └── barplot_prediction.hdf5    # Pre-trained ML model
│   ├── test_data/
│   │   ├── images/                    # Test benchmark images
│   │   ├── video_frames/              # Test video frames
│   │   └── audio/                     # Test audio transcripts
│   └── model_performance/             # Training metrics & charts
├── tests/                             # Test files
├── scripts/                           # Utility scripts
├── docs/                              # Documentation
├── config.yml                         # Application configuration
├── requirements.txt                   # Python dependencies
├── environment.yml                    # Conda environment
├── setup.py                           # Package setup
├── .env.example                       # Environment variables template
├── .gitignore                         # Git ignore rules
└── README.md                          # This file

Installation

Prerequisites

  • Python 3.9.18 or higher
  • Conda (recommended) or pip
  • Google Cloud account with Vision and Speech APIs enabled
  • OpenAI API account
  • YouTube Data API key

Setup Instructions

  1. Clone the repository

    git clone <repository-url>
    cd aibot
  2. Create and activate a virtual environment

    Using Conda (recommended):

    conda env create -f environment.yml
    conda activate aibot

    Using pip:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    pip install -r requirements.txt
  3. Configure credentials (IMPORTANT - See Security section below)

    Copy the example environment file:

    cp .env.example .env

    Edit .env and add your API keys:

    OPENAI_API_KEY=your_openai_api_key_here
    GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json
    YOUTUBE_API_KEY=your_youtube_api_key_here
    
  4. Configure application settings

    Edit config.yml to customize:

    • Processors to compare
    • Preferred review sources
    • GPT models to use
    • Timeout settings
    • Output directories
  5. Install the package

    pip install -e .

Usage

Command Line Interface

Run the complete pipeline:

aibot --xls-name CompetitiveAnalysis.xlsx --num-benchmarks 50

Parameters

  • --xls-name: Name of the output Excel file (default: CompetitiveAnalysisSummary.xlsx)
  • --num-benchmarks: Maximum number of benchmarks to process (default: 50)

Workflow

The bot executes the following steps sequentially:

  1. Search web and YouTube for processor reviews
  2. Prioritize preferred domains and channels
  3. Scrape web content (15-minute timeout per URL)
  4. Download and transcribe YouTube videos
  5. Run ML model to identify benchmark charts
  6. Extract text from images using OCR
  7. Summarize benchmarks using GPT-4
  8. Collate and deduplicate results
  9. Generate Excel report
  10. Clean up temporary files

Configuration

Processors

Edit config.yml to specify which processors to compare:

processors:
  intel:
    - Core i9-14900K
    - Core i7-13700K
  amd:
    - Ryzen 9 7950X
    - Ryzen 7 7800X3D

Preferred Sources

Customize which websites and YouTube channels to prioritize:

preferred_domains:
  - anandtech
  - tomshardware
  - pcmag

preferred_channels:
  - Dave2D
  - Hardware Canucks

GPT Models

Configure which GPT models to use for different tasks:

gpt_models:
  video_description_summary: gpt-3.5-turbo-0125
  test_conditions: gpt-4
  benchmark_summarization: gpt-4

Security

CRITICAL: API Credentials

NEVER commit API keys or credentials to version control.

This project requires three types of credentials:

  1. OpenAI API Key: Get from https://platform.openai.com/api-keys
  2. Google Cloud Service Account: Create at https://console.cloud.google.com/iam-admin/serviceaccounts
  3. YouTube Data API Key: Get from https://console.cloud.google.com/apis/credentials

All credentials should be stored in the .env file (which is ignored by git) or as environment variables.

If You Forked This Project

IMPORTANT: If you cloned this from a public repository that previously contained exposed credentials:

  1. The old API keys may have been compromised and should NOT be used
  2. Generate new API keys from scratch
  3. Review your API usage logs for any suspicious activity
  4. Enable API key restrictions and quotas in the respective cloud consoles

See SECURITY.md for detailed credential rotation instructions.

Output

The tool generates an Excel file containing:

  • Processor comparison data
  • Benchmark results across different tests
  • Source information (URLs, publication dates)
  • Test conditions and configurations
  • Performance metrics and analysis

Testing

Tests are located in the tests/ directory:

# Run all tests
python -m pytest tests/

# Run specific test
python tests/test_search_web.py

Troubleshooting

Common Issues

1. Import errors

  • Ensure you've activated the virtual environment
  • Reinstall dependencies: pip install -r requirements.txt
  • Install package in development mode: pip install -e .

2. API authentication errors

  • Verify your .env file exists and contains valid credentials
  • Check that GOOGLE_APPLICATION_CREDENTIALS points to a valid JSON file
  • Ensure API keys haven't expired

3. Timeout errors during web scraping

  • Some sites may be slow or blocking automated requests
  • Adjust timeout values in config.yml
  • Check robots.txt compliance for the target site

4. Memory issues

  • Reduce max_benchmarks in config
  • Process fewer videos simultaneously
  • Close other memory-intensive applications

5. Model prediction errors

  • Verify barplot_prediction.hdf5 exists in data/models/
  • Check TensorFlow installation: python -c "import tensorflow; print(tensorflow.__version__)"

Performance Considerations

  • Processing Time: Full pipeline can take 2-6 hours depending on the number of sources
  • Memory Usage: Expect 4-8 GB RAM usage during video processing
  • API Costs: GPT-4 API calls can incur significant costs; monitor usage
  • Rate Limits: Respect API rate limits; the tool includes delays between calls

Development

Code Organization

  • Entry Point: src/aibot/aibot_main.py
  • Search Modules: src/aibot/web_search.py, src/aibot/video_search.py
  • Scraping: src/aibot/scrape_web.py, src/aibot/get_relevant_webpages.py
  • Media Processing: src/aibot/extract_frames_from_video.py, src/aibot/convert_speech_to_text.py
  • ML/AI: src/aibot/convert_image_to_text.py, src/aibot/barplot_prediction_after_training.py
  • GPT Integration: src/aibot/gpt_wrapper_web.py, src/aibot/gpt_wrapper_video.py
  • Output: src/aibot/final_spreadsheet_write.py
  • Trained Model: data/models/barplot_prediction.hdf5

Adding New Processors

  1. Edit config.yml to add processor names
  2. No code changes required - the tool uses dynamic search

Adding New Sources

  1. Add domain/channel names to config.yml
  2. Sources are automatically prioritized by the tool

Limitations

  • Platform-Specific: Signal-based timeouts work on Unix/macOS only (not Windows)
  • Sequential Processing: No parallel processing of multiple sources
  • English Only: OCR and text processing optimized for English content
  • API Dependencies: Requires active API subscriptions and internet connection

License

[Specify your license here]

Contributing

[Add contribution guidelines]

Support

For issues or questions:

  • Check the Troubleshooting section above
  • Review logs in the logs/ directory
  • Open an issue on GitHub

Changelog

Version 1.0 (Current)

  • Initial release
  • Automated web and video scraping
  • ML-based benchmark chart detection
  • GPT-4 integration for summarization
  • Excel report generation

Acknowledgments

  • Google Cloud Vision & Speech APIs for OCR and transcription
  • OpenAI for GPT models
  • TensorFlow/PyTorch for ML capabilities
  • Selenium and BeautifulSoup for web scraping

About

Automated Competitive Analysis tool for Intel and AMD processors - scrapes web and YouTube reviews, uses ML to detect benchmark charts, and leverages LLMs to generate comparative analysis reports.

Topics

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages