AI DataHarvester

AI DataHarvester

An intelligent web content extraction application that uses natural language processing to transform web scraping into precise data harvesting.

🎯 Overview

AI DataHarvester combines the power of local LLMs (Large Language Models) with web scraping technologies to create an intelligent data extraction tool. Unlike traditional web scrapers that simply download content, this application understands what you're looking for and extracts specifically requested information using natural language queries.

📸 Screenshots

📊 AI DataHarvester - Model Select

🔍 Application Scraped Content

📝 Parsed Content Results

🛡️ Health Monitoring

✨ Features

🔍 Intelligent Web Extraction

Extract specific information from websites using natural language
Process and clean web content automatically
Handle various website structures and formats

🧠 AI-Powered Parsing

Use local LLMs to understand your queries
Extract precisely what you need from web content
Support for multiple LLM models (llama3.2, gemma, mistral, phi3, etc.)

🔄 Data Management

Reset functionality to quickly start new projects
Store extracted content in session for further processing
View and analyze raw content before extraction

💾 Export Options

Download parsed results as structured JSON files
Send data directly to webhooks for integration with other systems
Well-formatted data with timestamps and metadata

🛡️ Health Monitoring

Real-time monitoring of system components
Status indicators for LLM service and web scraping service
Troubleshooting guidance and quick fixes

🐳 Containerization

Docker-based deployment for consistent environment
Multi-container setup with orchestration
Volume persistence for logs and data

📂 Project Structure

ai-dataharvester/
├── .github/                  # GitHub workflows and CI/CD configuration
│   └── workflows/            # CI/CD workflow definitions
│       └── deploy.yml        # Deployment workflow
│
├── logs/                     # Application logs directory
│   ├── scraper.log           # Web scraping logs
│   ├── parser.log            # LLM parsing logs
│   ├── streamlit.log         # UI application logs
│   └── health.log            # Health monitoring logs
│
├── .env                      # Environment variables (credentials)
├── .gitignore                # Git ignore rules
├── docker-compose.yml        # Docker Compose configuration
├── Dockerfile                # Docker image definition
├── README.md                 # Project documentation
├── requirements.txt          # Python dependencies
├── setup.sh                  # Setup script for directory structure
│
├── main.py                   # Main Streamlit application
├── scrape.py                 # Web scraping functionality
├── parse.py                  # LLM parsing functionality
├── health.py                 # Health monitoring system
└── logger_config.py          # Centralized logging configuration

🚀 Getting Started

📋 Prerequisites

Ubuntu Desktop system
Python 3.11
Git
Ollama already running locally

🔧 Local Installation

Install Python 3.11 (if not already installed):

sudo apt install python3.11
sudo apt install python3.11-venv

Clone the repository:

# Navigate to your preferred installation directory
cd /path/to/your/preferred/directory
git clone https://github.com/urdiales/ai-dataharvester.git
cd ai-dataharvester

Create and activate a Python virtual environment:
```
python3.11 -m venv ai
source ai/bin/activate
```
Install the required dependencies:
```
pip install -r requirements.txt
```
Create needed directories (if they don't exist):
```
mkdir -p logs
```
Create an .env file for your credentials:
```
touch .env
nano .env
```
Add these lines to the file (replace with your actual credentials):
```
BRIGHTDATA_USER=your_brightdata_user
BRIGHTDATA_PASSWORD=your_brightdata_password
```
Save and exit (Ctrl+X, then Y, then Enter)

🐳 Docker Installation

Build and start the containers:
```
docker-compose up -d
```
Access the application at http://localhost:8501

🎮 Running the Application

Ensure your Ollama instance is running
Activate the virtual environment (if not already activated):
```
source ai/bin/activate
```
Start the Streamlit application:
```
streamlit run main.py
```
Access the application by opening a web browser and navigating to:
```
http://localhost:8501
```

🎮 Usage

🌐 Scraping a Website

Enter a website URL in the input field
Click "Scrape Website" and wait for the process to complete
The content will be extracted, cleaned, and stored for parsing

🔎 Parsing Content

With scraped content loaded, enter a natural language query
- Example: "What is the main topic of this website?"
- Example: "Extract all product names and prices"
- Example: "Find the author's contact information"
Select your preferred LLM model
Click "Parse Content" to extract the specific information

📊 Managing Results

View the parsed results directly in the interface
Download the results as a JSON file using the download button
Send the results to a webhook for integration with other systems
Reset all data when starting a new project

🏗️ Architecture

The application consists of two main components:

ai-dataharvester: Streamlit application for UI and web scraping
- Handles user interactions
- Performs web scraping via Bright Data
- Processes and cleans content
- Manages the parsing workflow
ollama: Local LLM service for content parsing
- Provides inference capabilities
- Supports multiple models
- Performs natural language understanding
- Extracts specific information based on queries

🔄 Data Flow

User Request → Streamlit UI → Web Scraping (Selenium/Bright Data) → Content Cleaning
                                                                       ↓
     JSON Export/Webhook ← Result Display ← LLM Parsing (Ollama) ← Content Processing

📝 Logging and Monitoring

Comprehensive logging system with files stored in the logs/ directory:

scraper.log - Web scraping operations and errors
parser.log - LLM parsing activities and responses
streamlit.log - UI and application flow
health.log - Health check information and system status

Health monitoring is available in the sidebar of the application, providing:

Real-time status of the Ollama LLM service
Connection status for Bright Data service
Troubleshooting guidance for common issues
Manual override options for development

🔧 Troubleshooting

If you encounter issues:

Check the application logs in the logs/ directory
Verify the health status in the application sidebar
Ensure your Bright Data credentials are correct in the .env file
Make sure the Ollama service is running
Try restarting the application

Configuring Ollama Connection

By default, the application connects to Ollama at http://localhost:11434. If your Ollama instance is running at a different address:

Add the OLLAMA_HOST variable to your .env file:

OLLAMA_HOST=http://your-ollama-host:11434

For troubleshooting Ollama connection issues:
- Check that Ollama is running using ollama list in terminal
- Verify your Ollama API is accessible at the configured address
- Make sure you have the required models installed (llama3.2, etc.)

Common solutions:

Reset the application data if encountering UI issues
Check network connectivity for webhook and scraping operations
Verify that required LLM models are downloaded in Ollama

🔒 Security Notes

Credentials are stored in environment variables, not hardcoded
Webhook connections use HTTPS for secure data transmission
Logs are segregated by component for better auditing

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📜 License

This project is licensed under the MIT License

🙏 Acknowledgments

Streamlit for the UI framework
Ollama for local LLM capabilities
Bright Data for web scraping infrastructure
Selenium for browser automation
LangChain for LLM integration

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
screenshots		screenshots
.env		.env
README.md		README.md
chromedriver.exe		chromedriver.exe
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
github-workflow.yml		github-workflow.yml
gitignore.txt		gitignore.txt
health.py		health.py
logger_config.py		logger_config.py
main.py		main.py
page.png		page.png
parse.py		parse.py
requirements.txt		requirements.txt
scrape.py		scrape.py
setup-script.sh		setup-script.sh

Folders and files

Latest commit

History

Repository files navigation

AI DataHarvester

🎯 Overview

📸 Screenshots

📊 AI DataHarvester - Model Select

🔍 Application Scraped Content

🛡️ Health Monitoring

✨ Features

🔍 Intelligent Web Extraction

🧠 AI-Powered Parsing

🔄 Data Management

💾 Export Options

🛡️ Health Monitoring

🐳 Containerization

📂 Project Structure

🚀 Getting Started

📋 Prerequisites

🔧 Local Installation

🐳 Docker Installation

🎮 Running the Application

🎮 Usage

🌐 Scraping a Website

🔎 Parsing Content

📊 Managing Results

🏗️ Architecture

🔄 Data Flow

📝 Logging and Monitoring

🔧 Troubleshooting

Configuring Ollama Connection

🔒 Security Notes

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages