Skip to content

STiFLeR7/Edge-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Edge-LLM: Optimized Qwen2.5-3B with GPTQ ⚙️🧠

Edge-LLM is a lightweight and efficient deployment of the Qwen2.5-3B model, quantized using GPTQ to enable fast, low-memory inference on edge devices. This project is designed for real-time applications that require high-performance language understanding in constrained environments such as consumer GPUs or embedded systems.

📌 Current Scope (v0.1)

  • ✅ Qwen2.5-3B model integrated and quantized with GPTQ (4-bit precision).
  • ✅ Achieved 66.5% size reduction: 5.75GB ➝ 1.93GB.
  • ✅ Inference speed improved from 7.29s ➝ 5.99s (~18% faster).
  • ✅ Optimized to run on consumer-grade edge GPUs (e.g., RTX 3050).
  • ✅ Benchmarking and logging scripts integrated.

🚧 Under Development

  • ⏳ Hugging Face integration for seamless model download.
  • ⏳ ONNX export and TensorRT optimization.
  • ⏳ Quantization-aware fine-tuning to reduce response drift.
  • ⏳ Comparative analysis: Qwen2.5-3B vs Phi-2 vs LLaMA2-7B on edge.

🔮 Future Roadmap

  • 🤖 Edge Agent with vision+text multimodal capability.
  • 📲 Integration with Android/iOS for mobile inference.
  • ⚡ LoRA fine-tuning pipeline for domain-specific compression.
  • 🧪 Evaluation suite with perplexity, latency, and accuracy tracking.
  • 🛰️ Federated deployment across IoT medical/industrial nodes.

🧱 Tech Stack

  • Python 3.10+
  • PyTorch + Transformers
  • GPTQ (AutoGPTQ)
  • Hugging Face 🤗 Hub
  • Git LFS
  • CUDA 11.8+

📂 Project Structure

Edge-LLM/
├── models/                      # Quantized model repo (git-lfs tracked)
├── scripts/
│   ├── benchmark.py             # Inference benchmarking script
│   └── inference.py            # Lightweight inference API
├── results/                    # Output logs and generated responses
├── requirements.txt            # Dependencies
└── README.md                   # This file

🚀 Setup & Installation

# Clone repo
git clone https://github.com/STiFLeR7/Edge-LLM.git
cd Edge-LLM

# Create virtual environment
python -m venv env
source env/bin/activate  # For Windows: .\env\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download quantized model
git lfs install
git clone https://huggingface.co/<your-hf-repo> models/Qwen2.5-3B-GPTQ

📊 Benchmark Results

Metric Pre-Quantization Post-Quantization
Model Size 5.75 GB 1.93 GB
Inference Time 7.29 s 5.99 s

🏃 Running Inference

python scripts/benchmark.py

Expected Output:

🔹 Generated Response:
Black holes are regions of space where gravity is so strong that nothing, not even light, can escape...
⏳ Inference Time: ~5.99s

🤝 Contributors

  • Stifler – Researcher & Developer @ NIMS | AI/ML/DL | CudaBit Tech Lead
  • Open to contributions! If you're passionate about model compression, edge deployment, or LLM optimization, feel free to raise issues or submit PRs.

📜 License

This project is licensed under the MIT License. See LICENSE for more details.

About

Optimized Qwen2.5-3B using GPTQ, reducing size from 5.75GB → 1.93GB and improving inference speed. Ideal for efficient edge AI deployments.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages