End-to-end workflow for quantizing a TinyLlama LLM CoreML model to int8 on Apple Silicon (M1/M2/M3). Script, logs, and all gotchas included.
Most open-source CoreML LLMs are only provided in float16. This means files are huge (2GB+), RAM usage is high, and iOS/macOS on-device deployment isn’t always efficient.
Official docs and forums rarely provide a working, up-to-date, step-by-step guide for post-conversion quantization (.mlpackage
) on Mac ARM.
This repo gives you a tested Python script, setup steps, and real logs from a successful int8 quantization (May 2025, Mac ARM, Python 3.11, coremltools 8.3.0).
- Mac ARM (Apple Silicon: M1, M2, M3)
- Python 3.11+ (install with
brew install [email protected]
) - coremltools >= 8.3.0 (
pip3 install --upgrade coremltools
)
-
Download your CoreML LLM Example: TKDKid1000/TinyLlama-1.1B-Chat-v0.3-CoreML
huggingface-cli download --local-dir ./Models/TinyLlama TKDKid1000/TinyLlama-1.1B-Chat-v0.3-CoreML
-
Copy
quantize_coreml.py
into the directory containingfloat16_model.mlpackage
-
Run the script:
python3.11 quantize_coreml.py
import coremltools as ct
from coremltools.optimize.coreml import OpLinearQuantizerConfig, OptimizationConfig, linear_quantize_weights
float16_model_path = "float16_model.mlpackage"
quant8_model_path = "quant8_model.mlpackage"
print(f"🔄 Loading float16 model: {float16_model_path}")
model = ct.models.MLModel(float16_model_path)
op_config = OpLinearQuantizerConfig(mode="linear_symmetric")
config = OptimizationConfig(global_config=op_config)
print("⚡ Quantizing to 8 bits (int8)...")
try:
quant8_model = linear_quantize_weights(model, config=config)
quant8_model.save(quant8_model_path)
print(f"✅ Quantized int8 model saved: {quant8_model_path}")
except Exception as e:
print(f"❌ Quantization failed: {e}")
- File size reduction: From ~2.2GB (
float16
) → much smaller (quant8
) - Works on device: iOS 17 / macOS Sonoma, Apple Silicon
- int4 quantization: Only available for iOS 18+ (will fail on current OS/devices)
- Ignore warnings like “inf/-inf not supported by quantization. Skipped.”
- Quantization 4 bits (int4) is only supported on iOS 18+ and newer CoreML.
- If you get
ModuleNotFoundError: No module named 'coremltools.optimize.coreml'
, update coremltools and use Python 3.11+ on Mac ARM. - Make sure you run the script inside the directory containing
float16_model.mlpackage
(or adapt the path).
- Feel free to open issues or PRs!
- Share this repo on Reddit, Hugging Face forums, or Discord to help others.
Created by GreenBull31 (Morgan CAMILLERI)