A comprehensive collection of Python utilities and tools for various tasks including transcription, RAG (Retrieval-Augmented Generation), web scraping, and general utilities.
# Clone the repository
git clone https://github.com/yourusername/python-utils-and-tools.git
cd python-utils-and-tools
# Install the package
pip install -e .
Transcribe and translate audio/video files using OpenAI's Whisper model.
Quick Usage:
from pyut.transcribe import main as transcribe_main
config = {
"MODEL": "base",
"GPU": True,
"FILES": [
{
"FILE_PATH": "path/to/your/file.mp4",
"FILE_LANGUAGE": "en",
"TRANSLATE": False,
"TIMESTAMP": True
}
]
}
transcribe_main.convert_to_text(config)
Process, embed, and manage data for retrieval-augmented generation tasks.
Quick Usage:
from pyut.rag import combine_json, embed, upload_records
# Combine JSON files
combine_json.process_files("input_dir", "output.json")
# Generate embeddings
embed.make_metadata("input.json", "metadata.json")
embed.add_embedding("metadata.json", "embeddings.json")
# Upload to vector database
upload_records.supabase_upload_records("embeddings.json", "collection_name")
Extract and process data from web sources with JavaScript rendering support.
Quick Usage:
from pyut.web_scrapping import WebScraper
scraper = WebScraper(delay=2, number_of_urls_to_scrape=10)
urls = ["https://example.com/page1", "https://example.com/page2"]
scraping_states = [False] * len(urls)
scraper.scrape_urls(urls, scraping_states, subdir="category1")
Common utilities for file operations, logging, and time estimation.
Quick Usage:
from pyut.utils import TimeEstimator, FileSystemProcessor, logger
# Time estimation
estimator = TimeEstimator(number_of_iterations=100)
estimator.start_iteration()
# Your code here
estimator.update_processing_time()
# File processing
fsp = FileSystemProcessor(root_dir="data")
data = fsp.load_json("input.json")
fsp.save_json("output.json", data, backup=True)
# Logging
logger.info("Processing started")
The project uses several key dependencies:
rich
: Terminal formatting and progress barsopenai
: OpenAI API integrationvecs
: Vector operationstiktoken
: Token countingpython-dotenv
: Environment variable managementopenai-whisper
: Audio transcriptionpython-magic
: File type detection
-
System Requirements:
- Python 3.x
- FFmpeg (for transcription)
- CUDA-compatible GPU (optional, for faster processing)
- Chrome browser (for web scraping)
-
Install the Package:
pip install -e .
-
Environment Setup:
- Copy
.env.example
to.env
in each module directory - Configure your API keys and settings
- Copy
A powerful tool for transcribing and translating audio/video files using OpenAI's Whisper model.
- ποΈ Transcribe audio/video files to text
- π Translate audio/video content to English
- β±οΈ Optional timestamp support for transcriptions
- π Progress visualization and duration plotting
- π Batch processing capabilities
- π― GPU acceleration support
- π Detailed logging and error handling
Handles file validation and preparation:
- Checks file existence and audio stream presence
- Calculates file durations
- Generates file status reports
- Creates visualization plots of file durations
- Prepares JSON configuration for batch processing
Performs the actual transcription/translation:
- Supports multiple Whisper models
- GPU acceleration when available
- Timestamp generation
- Translation to English
- Progress tracking and logging
Create a JSON configuration file with the following structure:
{
"MODEL": "base", // Whisper model size (tiny, base, small, medium, large)
"GPU": true, // Enable GPU acceleration
"FILES": [
{
"FILE_PATH": "path/to/your/file.mp4",
"FILE_LANGUAGE": "en", // Source language code
"TRANSLATE": false, // Whether to translate to English
"TIMESTAMP": true // Whether to include timestamps
}
]
}
The tool generates:
- Text files with transcriptions/translations
- Optional timestamps for each segment
- Progress visualization plots
- Success logs with processing time
- The tool automatically skips files that have already been processed
- Processing time varies based on file duration and model size
- GPU acceleration significantly improves processing speed
- Supported input formats include MP4, MP3, WAV, and other common audio/video formats
A powerful module for processing, embedding, and managing data for retrieval-augmented generation tasks.
- π JSON data combination and deduplication
- π Text chunking and embedding generation
- π Vector database integration
- β±οΈ Time estimation for long-running tasks
- π Metadata management
- π Environment-based configuration
Handles JSON data processing and deduplication:
- Combines multiple JSON files
- Removes duplicates based on specified keys
- Maintains category information
- Preserves data integrity
Creates and manages vector embeddings:
- Text chunking with overlap
- Token counting and management
- Metadata generation
- Azure OpenAI integration
- Batch processing support
Manages vector database operations:
- Vector database connection
- Batch record uploading
- Index creation and management
- Progress tracking
AZURE_OPENAI_KEY=your_api_key
AZURE_OPENAI_MODELID=your_model_id
OPENAI_API_VERSION=your_api_version
AZURE_OPENAI_ENDPOINT=your_endpoint
user=your_db_user
host=your_db_host
port=your_db_port
dbname=your_db_name
password=your_db_password
max_tokens
: Maximum tokens per chunk (default: 500)overlap
: Token overlap between chunks (default: 50)
The module includes comprehensive error handling for:
- API connection issues
- Database connectivity problems
- File I/O operations
- Data validation
- Environment configuration
A robust and efficient web scraping tool that provides advanced features for extracting and processing web content.
- π JavaScript-rendered page scraping
- π§Ή Intelligent content cleaning
- π Markdown conversion
- β±οΈ Time estimation and progress tracking
- π Content analysis and statistics
- π Batch processing capabilities
- π Organized output management
Handles web page retrieval:
- Headless Chrome browser automation
- JavaScript rendering
- Anti-detection measures
- Error handling
Processes and cleans web content:
- Removes unwanted elements (ads, scripts, etc.)
- Extracts metadata
- Converts content to Markdown
- Maintains content structure
Main scraping orchestration:
- URL processing
- Content extraction
- File management
- Progress tracking
The scraper generates multiple output formats for each URL:
.txt
: Full content with metadata.html
: Original HTML content.md
: Markdown version of the content- JSON summary of all scraped content
The HTML cleaner removes:
- Scripts and styles
- Images and SVGs
- Headers and footers
- Navigation elements
- Ad-related content
-
Rate Limiting
- Set appropriate delays between requests
- Respect robots.txt
- Use batch processing for large datasets
-
Content Processing
- Verify content extraction
- Check output formats
- Monitor file sizes
-
Resource Management
- Close browser instances
- Clean up temporary files
- Monitor memory usage
A collection of essential utility tools and helpers used across the project.
- β±οΈ Time estimation for long-running tasks
- π File system operations
- π Rich logging capabilities
- π JSON data handling
- πΎ Backup and restore functionality
Provides time estimation for iterative tasks:
- Progress tracking
- Remaining time calculation
- Iteration counting
- Average processing time estimation
Handles file operations:
- JSON file loading and saving
- File backup and restore
- Directory management
- Data validation
Rich logging capabilities:
- Console output formatting
- Log level management
- Traceback handling
- Custom formatting options
root_dir
: Base directory for operationsprocess_subdirs
: Whether to process subdirectoriesbackup
: Enable/disable automatic backupsappend_not_overwrite
: Append to existing filesensure_ascii
: ASCII encoding for JSON
rich_print_or_logs
: Output mode ('rich_print' or 'logs')- Log level: INFO, ERROR, WARNING, etc.
- Custom formatting options
-
Time Estimation
- Initialize with accurate iteration count
- Update processing time regularly
- Monitor for significant deviations
-
File Operations
- Always use backup for critical files
- Validate JSON data before saving
- Handle encoding properly
-
Logging
- Use appropriate log levels
- Include relevant context
- Format messages consistently
Contributions are welcome! Please feel free to submit issues and pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- Mohamed Nagy - [email protected]
- OpenAI for the Whisper model
- The open-source community for various tools and libraries