An intelligent web content extraction application that uses natural language processing to transform web scraping into precise data harvesting.
AI DataHarvester combines the power of local LLMs (Large Language Models) with web scraping technologies to create an intelligent data extraction tool. Unlike traditional web scrapers that simply download content, this application understands what you're looking for and extracts specifically requested information using natural language queries.
๐ Parsed Content Results
- Extract specific information from websites using natural language
- Process and clean web content automatically
- Handle various website structures and formats
- Use local LLMs to understand your queries
- Extract precisely what you need from web content
- Support for multiple LLM models (llama3.2, gemma, mistral, phi3, etc.)
- Reset functionality to quickly start new projects
- Store extracted content in session for further processing
- View and analyze raw content before extraction
- Download parsed results as structured JSON files
- Send data directly to webhooks for integration with other systems
- Well-formatted data with timestamps and metadata
- Real-time monitoring of system components
- Status indicators for LLM service and web scraping service
- Troubleshooting guidance and quick fixes
- Docker-based deployment for consistent environment
- Multi-container setup with orchestration
- Volume persistence for logs and data
ai-dataharvester/
โโโ .github/ # GitHub workflows and CI/CD configuration
โ โโโ workflows/ # CI/CD workflow definitions
โ โโโ deploy.yml # Deployment workflow
โ
โโโ logs/ # Application logs directory
โ โโโ scraper.log # Web scraping logs
โ โโโ parser.log # LLM parsing logs
โ โโโ streamlit.log # UI application logs
โ โโโ health.log # Health monitoring logs
โ
โโโ .env # Environment variables (credentials)
โโโ .gitignore # Git ignore rules
โโโ docker-compose.yml # Docker Compose configuration
โโโ Dockerfile # Docker image definition
โโโ README.md # Project documentation
โโโ requirements.txt # Python dependencies
โโโ setup.sh # Setup script for directory structure
โ
โโโ main.py # Main Streamlit application
โโโ scrape.py # Web scraping functionality
โโโ parse.py # LLM parsing functionality
โโโ health.py # Health monitoring system
โโโ logger_config.py # Centralized logging configuration
- Ubuntu Desktop system
- Python 3.11
- Git
- Ollama already running locally
-
Install Python 3.11 (if not already installed):
sudo apt install python3.11 sudo apt install python3.11-venv
-
Clone the repository:
# Navigate to your preferred installation directory cd /path/to/your/preferred/directory git clone https://github.com/urdiales/ai-dataharvester.git cd ai-dataharvester
-
Create and activate a Python virtual environment:
python3.11 -m venv ai source ai/bin/activate -
Install the required dependencies:
pip install -r requirements.txt
-
Create needed directories (if they don't exist):
mkdir -p logs
-
Create an .env file for your credentials:
touch .env nano .env
Add these lines to the file (replace with your actual credentials):
BRIGHTDATA_USER=your_brightdata_user BRIGHTDATA_PASSWORD=your_brightdata_passwordSave and exit (Ctrl+X, then Y, then Enter)
-
Build and start the containers:
docker-compose up -d
-
Access the application at http://localhost:8501
-
Ensure your Ollama instance is running
-
Activate the virtual environment (if not already activated):
source ai/bin/activate -
Start the Streamlit application:
streamlit run main.py
-
Access the application by opening a web browser and navigating to:
http://localhost:8501
- Enter a website URL in the input field
- Click "Scrape Website" and wait for the process to complete
- The content will be extracted, cleaned, and stored for parsing
- With scraped content loaded, enter a natural language query
- Example: "What is the main topic of this website?"
- Example: "Extract all product names and prices"
- Example: "Find the author's contact information"
- Select your preferred LLM model
- Click "Parse Content" to extract the specific information
- View the parsed results directly in the interface
- Download the results as a JSON file using the download button
- Send the results to a webhook for integration with other systems
- Reset all data when starting a new project
The application consists of two main components:
-
ai-dataharvester: Streamlit application for UI and web scraping
- Handles user interactions
- Performs web scraping via Bright Data
- Processes and cleans content
- Manages the parsing workflow
-
ollama: Local LLM service for content parsing
- Provides inference capabilities
- Supports multiple models
- Performs natural language understanding
- Extracts specific information based on queries
User Request โ Streamlit UI โ Web Scraping (Selenium/Bright Data) โ Content Cleaning
โ
JSON Export/Webhook โ Result Display โ LLM Parsing (Ollama) โ Content Processing
Comprehensive logging system with files stored in the logs/ directory:
scraper.log- Web scraping operations and errorsparser.log- LLM parsing activities and responsesstreamlit.log- UI and application flowhealth.log- Health check information and system status
Health monitoring is available in the sidebar of the application, providing:
- Real-time status of the Ollama LLM service
- Connection status for Bright Data service
- Troubleshooting guidance for common issues
- Manual override options for development
If you encounter issues:
- Check the application logs in the
logs/directory - Verify the health status in the application sidebar
- Ensure your Bright Data credentials are correct in the
.envfile - Make sure the Ollama service is running
- Try restarting the application
By default, the application connects to Ollama at http://localhost:11434. If your Ollama instance is running at a different address:
-
Add the
OLLAMA_HOSTvariable to your.envfile:OLLAMA_HOST=http://your-ollama-host:11434 -
For troubleshooting Ollama connection issues:
- Check that Ollama is running using
ollama listin terminal - Verify your Ollama API is accessible at the configured address
- Make sure you have the required models installed (
llama3.2, etc.)
- Check that Ollama is running using
Common solutions:
- Reset the application data if encountering UI issues
- Check network connectivity for webhook and scraping operations
- Verify that required LLM models are downloaded in Ollama
- Credentials are stored in environment variables, not hardcoded
- Webhook connections use HTTPS for secure data transmission
- Logs are segregated by component for better auditing
Contributions are welcome! Please feel free to submit a Pull Request.
Copyright (c) 2025 [David Urdiales]
This project is licensed under the MIT License
- Streamlit for the UI framework
- Ollama for local LLM capabilities
- Bright Data for web scraping infrastructure
- Selenium for browser automation
- LangChain for LLM integration




