This repository contains various Python utilities and tools, including web scraping, data processing, and other helper scripts. The tools are modular and designed to handle specific tasks efficiently.
Clone the repository:
git clone https://github.com/mohamednaji7/python-utils-and-tools.git python-utils-and-tools
- Navigate to the project directory:
cd python-utils-and-tools/src
- Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
The web-scrapping
tool is designed for scraping data from websites. It uses Selenium and BeautifulSoup for rendering and extracting content.
cd web-scrapping
python3 main.py
The rag/
directory contains scripts for processing and embedding data for retrieval-augmented generation tasks.
-
Combine JSON files:
Combines multiple JSON files into a single deduplicated file.
python3 rag/combine-json.py
-
Embed Metadata:
Generates embeddings for text data and stores metadata.
python3 rag/embed.py
-
Upload Records:
Uploads records to a vector database for retrieval tasks.
python3 rag/upload_records.py
The utils/
directory contains helper scripts for file operations, logging, and time estimation.
-
File Processor:
Handles file operations like loading and saving JSON files.
python3 utils/file_processor.py
-
Logger:
Provides rich logging capabilities for debugging and monitoring.
python3 utils/logger.py
-
Time Estimator:
Estimates the time required for iterative tasks.
python3 utils/time_estimator.py
- Follow the
.gitignore
rules to avoid committing unnecessary files (e.g.,__pycache__/
,*.pyc
,venv/
, etc.). - Backup your data files before running scripts that modify them.
- Ensure all environment variables are set correctly for scripts requiring API keys or database connections.
Contributions are welcome! Please fork the repository and submit a pull request with your changes.