This project provides a complete pipeline for fine-tuning the Qwen2.5-VL-7B-Instruct model on Arabic image captions using LlamaFactory.
- Complete fine-tuning pipeline using LlamaFactory
- LoRA (Low-Rank Adaptation) for efficient training
- Support for both standard and conservative configurations
- Automated dataset preparation from Excel files
- Model evaluation and caption generation
- Google Colab and local environment support
- Minimum: 12GB VRAM (RTX 3080/4080, Tesla T4)
- Recommended: 16GB+ VRAM (RTX 4090, A100)
- 32GB+ system RAM
- ~50GB free disk space
- Python 3.8+
- CUDA 11.8+ or 12.0+
- Git
- Install dependencies:
pip install -r requirements_finetune.txt- The setup script will automatically install LlamaFactory
arabic-image-captioning-finetune/
├── finetune_trainer.py # Main trainer class
├── finetune_config.py # Configuration settings
├── finetune_utils.py # Utility functions
├── setup_training.py # Setup script
├── run_training.py # Simple training runner
├── evaluate_model.py # Model evaluation script
├── requirements_finetune.txt # Dependencies
└── README_FINETUNE.md # This file
- Setup and prepare data:
# In Colab cell
!python setup_training.py --colab- Start training:
!python run_training.py --colab- Prepare your data structure:
your_base_dir/
├── Train/
│ ├── TrainSubtask2.xlsx # Excel file with image names and Arabic descriptions
│ └── images/ # Training images
└── Test/
└── images/ # Test images (optional)
- Setup:
python setup_training.py \
--base_dir /path/to/your/data \
--excel_file /path/to/TrainSubtask2.xlsx \
--images_dir /path/to/images- Start training:
python run_training.py --base_dir /path/to/your/dataStandard Configuration (for 16GB+ VRAM):
- LoRA rank: 8
- Batch size: 1
- Gradient accumulation: 16 steps
Conservative Configuration (for 12GB VRAM):
- LoRA rank: 4
- Batch size: 1
- Gradient accumulation: 32 steps
- Reduced workers
Key parameters you can adjust in finetune_config.py:
TRAINING_CONFIG = {
"lora_rank": 8, # LoRA rank (4-16)
"lora_alpha": 16, # LoRA alpha
"learning_rate": 2.0e-5, # Learning rate
"num_train_epochs": 15.0, # Number of epochs
"warmup_ratio": 0.1, # Warmup ratio
# ... more options
}Your TrainSubtask2.xlsx should have columns:
File Name: Image filename (without extension)Description: Arabic caption for the image
| File Name | Description |
|---|---|
| IMG001 | صورة تاريخية تظهر مدينة القدس القديمة |
| IMG002 | مشهد من الحياة اليومية في فلسطين |
- Environment Setup: Installs LlamaFactory and dependencies
- Dataset Preparation: Converts Excel data to LlamaFactory format
- Dataset Registration: Registers dataset in LlamaFactory
- Configuration Creation: Generates YAML training config
- Training: Runs LoRA fine-tuning
- Evaluation: Tests model on validation/test images
Training outputs are saved to:
- Model checkpoints:
{output_dir}/checkpoint-{step}/ - Training logs: Console output with loss curves
- Configuration:
{base_dir}/qwen_arabic_*.yaml
# Evaluate latest checkpoint
python evaluate_model.py --base_dir /path/to/data
# Evaluate specific checkpoint
python evaluate_model.py \
--base_dir /path/to/data \
--checkpoint checkpoint-50 \
--max_images 100
# List available checkpoints
python evaluate_model.py \
--base_dir /path/to/data \
--list_checkpointsEvaluation generates:
generated_arabic_captions.json: Detailed resultsfine_tune_generated_arabic_captions.csv: CSV format results
-
CUDA Out of Memory:
- Use conservative configuration:
--conservative - Reduce batch size in config
- Enable gradient checkpointing
- Use conservative configuration:
-
Dataset Loading Errors:
- Verify image paths are correct
- Check Excel file format
- Ensure images are not corrupted
-
LlamaFactory Installation Issues:
- Install from source:
pip install -e ".[torch,metrics]" - Check PyTorch compatibility
- Install from source:
For limited VRAM:
# Use these settings in custom_config
custom_config = {
"lora_rank": 4,
"gradient_accumulation_steps": 64,
"dataloader_num_workers": 0,
"preprocessing_num_workers": 1
}- Use FP16: Enabled by default, reduces memory usage
- Gradient Checkpointing: Trades compute for memory
- LoRA Settings: Lower rank = less memory, potentially less quality
- Batch Size: Increase gradient accumulation instead of batch size
The fine-tuned model will generate Arabic captions in the style of your training data. Example output:
Input: Image of historical building
Output: صورة تاريخية تظهر مبنى قديم في القدس
This project uses the Qwen2.5-VL model and LlamaFactory. Please refer to their respective licenses for usage terms.