Skip to content

urdiales/AI-DataHarvester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

23 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

AI Concept

AI DataHarvester

An intelligent web content extraction application that uses natural language processing to transform web scraping into precise data harvesting.

๐ŸŽฏ Overview

AI DataHarvester combines the power of local LLMs (Large Language Models) with web scraping technologies to create an intelligent data extraction tool. Unlike traditional web scrapers that simply download content, this application understands what you're looking for and extracts specifically requested information using natural language queries.

๐Ÿ“ธ Screenshots

๐Ÿ“Š AI DataHarvester - Model Select

AI DataHarvester Model Select

๐Ÿ” Application Scraped Content

Scraped Content

๐Ÿ“ Parsed Content Results

Parsed Content Results

๐Ÿ›ก๏ธ Health Monitoring

Health Monitoring Panel

โœจ Features

๐Ÿ” Intelligent Web Extraction

  • Extract specific information from websites using natural language
  • Process and clean web content automatically
  • Handle various website structures and formats

๐Ÿง  AI-Powered Parsing

  • Use local LLMs to understand your queries
  • Extract precisely what you need from web content
  • Support for multiple LLM models (llama3.2, gemma, mistral, phi3, etc.)

๐Ÿ”„ Data Management

  • Reset functionality to quickly start new projects
  • Store extracted content in session for further processing
  • View and analyze raw content before extraction

๐Ÿ’พ Export Options

  • Download parsed results as structured JSON files
  • Send data directly to webhooks for integration with other systems
  • Well-formatted data with timestamps and metadata

๐Ÿ›ก๏ธ Health Monitoring

  • Real-time monitoring of system components
  • Status indicators for LLM service and web scraping service
  • Troubleshooting guidance and quick fixes

๐Ÿณ Containerization

  • Docker-based deployment for consistent environment
  • Multi-container setup with orchestration
  • Volume persistence for logs and data

๐Ÿ“‚ Project Structure

ai-dataharvester/
โ”œโ”€โ”€ .github/                  # GitHub workflows and CI/CD configuration
โ”‚   โ””โ”€โ”€ workflows/            # CI/CD workflow definitions
โ”‚       โ””โ”€โ”€ deploy.yml        # Deployment workflow
โ”‚
โ”œโ”€โ”€ logs/                     # Application logs directory
โ”‚   โ”œโ”€โ”€ scraper.log           # Web scraping logs
โ”‚   โ”œโ”€โ”€ parser.log            # LLM parsing logs
โ”‚   โ”œโ”€โ”€ streamlit.log         # UI application logs
โ”‚   โ””โ”€โ”€ health.log            # Health monitoring logs
โ”‚
โ”œโ”€โ”€ .env                      # Environment variables (credentials)
โ”œโ”€โ”€ .gitignore                # Git ignore rules
โ”œโ”€โ”€ docker-compose.yml        # Docker Compose configuration
โ”œโ”€โ”€ Dockerfile                # Docker image definition
โ”œโ”€โ”€ README.md                 # Project documentation
โ”œโ”€โ”€ requirements.txt          # Python dependencies
โ”œโ”€โ”€ setup.sh                  # Setup script for directory structure
โ”‚
โ”œโ”€โ”€ main.py                   # Main Streamlit application
โ”œโ”€โ”€ scrape.py                 # Web scraping functionality
โ”œโ”€โ”€ parse.py                  # LLM parsing functionality
โ”œโ”€โ”€ health.py                 # Health monitoring system
โ””โ”€โ”€ logger_config.py          # Centralized logging configuration

๐Ÿš€ Getting Started

๐Ÿ“‹ Prerequisites

  • Ubuntu Desktop system
  • Python 3.11
  • Git
  • Ollama already running locally

๐Ÿ”ง Local Installation

  1. Install Python 3.11 (if not already installed):

    sudo apt install python3.11
    sudo apt install python3.11-venv
  2. Clone the repository:

    # Navigate to your preferred installation directory
    cd /path/to/your/preferred/directory
    git clone https://github.com/urdiales/ai-dataharvester.git
    cd ai-dataharvester
  3. Create and activate a Python virtual environment:

    python3.11 -m venv ai
    source ai/bin/activate
  4. Install the required dependencies:

    pip install -r requirements.txt
  5. Create needed directories (if they don't exist):

    mkdir -p logs
  6. Create an .env file for your credentials:

    touch .env
    nano .env

    Add these lines to the file (replace with your actual credentials):

    BRIGHTDATA_USER=your_brightdata_user
    BRIGHTDATA_PASSWORD=your_brightdata_password
    

    Save and exit (Ctrl+X, then Y, then Enter)

๐Ÿณ Docker Installation

  1. Build and start the containers:

    docker-compose up -d
  2. Access the application at http://localhost:8501

๐ŸŽฎ Running the Application

  1. Ensure your Ollama instance is running

  2. Activate the virtual environment (if not already activated):

    source ai/bin/activate
  3. Start the Streamlit application:

    streamlit run main.py
  4. Access the application by opening a web browser and navigating to:

    http://localhost:8501
    

๐ŸŽฎ Usage

๐ŸŒ Scraping a Website

  1. Enter a website URL in the input field
  2. Click "Scrape Website" and wait for the process to complete
  3. The content will be extracted, cleaned, and stored for parsing

๐Ÿ”Ž Parsing Content

  1. With scraped content loaded, enter a natural language query
    • Example: "What is the main topic of this website?"
    • Example: "Extract all product names and prices"
    • Example: "Find the author's contact information"
  2. Select your preferred LLM model
  3. Click "Parse Content" to extract the specific information

๐Ÿ“Š Managing Results

  1. View the parsed results directly in the interface
  2. Download the results as a JSON file using the download button
  3. Send the results to a webhook for integration with other systems
  4. Reset all data when starting a new project

๐Ÿ—๏ธ Architecture

The application consists of two main components:

  1. ai-dataharvester: Streamlit application for UI and web scraping

    • Handles user interactions
    • Performs web scraping via Bright Data
    • Processes and cleans content
    • Manages the parsing workflow
  2. ollama: Local LLM service for content parsing

    • Provides inference capabilities
    • Supports multiple models
    • Performs natural language understanding
    • Extracts specific information based on queries

๐Ÿ”„ Data Flow

User Request โ†’ Streamlit UI โ†’ Web Scraping (Selenium/Bright Data) โ†’ Content Cleaning
                                                                       โ†“
     JSON Export/Webhook โ† Result Display โ† LLM Parsing (Ollama) โ† Content Processing

๐Ÿ“ Logging and Monitoring

Comprehensive logging system with files stored in the logs/ directory:

  • scraper.log - Web scraping operations and errors
  • parser.log - LLM parsing activities and responses
  • streamlit.log - UI and application flow
  • health.log - Health check information and system status

Health monitoring is available in the sidebar of the application, providing:

  • Real-time status of the Ollama LLM service
  • Connection status for Bright Data service
  • Troubleshooting guidance for common issues
  • Manual override options for development

๐Ÿ”ง Troubleshooting

If you encounter issues:

  1. Check the application logs in the logs/ directory
  2. Verify the health status in the application sidebar
  3. Ensure your Bright Data credentials are correct in the .env file
  4. Make sure the Ollama service is running
  5. Try restarting the application

Configuring Ollama Connection

By default, the application connects to Ollama at http://localhost:11434. If your Ollama instance is running at a different address:

  1. Add the OLLAMA_HOST variable to your .env file:

    OLLAMA_HOST=http://your-ollama-host:11434
    
  2. For troubleshooting Ollama connection issues:

    • Check that Ollama is running using ollama list in terminal
    • Verify your Ollama API is accessible at the configured address
    • Make sure you have the required models installed (llama3.2, etc.)

Common solutions:

  • Reset the application data if encountering UI issues
  • Check network connectivity for webhook and scraping operations
  • Verify that required LLM models are downloaded in Ollama

๐Ÿ”’ Security Notes

  • Credentials are stored in environment variables, not hardcoded
  • Webhook connections use HTTPS for secure data transmission
  • Logs are segregated by component for better auditing

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“œ License

Copyright (c) 2025 [David Urdiales]

This project is licensed under the MIT License

๐Ÿ™ Acknowledgments

  • Streamlit for the UI framework
  • Ollama for local LLM capabilities
  • Bright Data for web scraping infrastructure
  • Selenium for browser automation
  • LangChain for LLM integration

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors