A minimal CLI that runs Llama-3.2-1B-Instruct locally in Rust on Apple Silicon (Metal), built on top of candle and Hugging Face's GGUF tooling.
The whole program is one ~90-line main.rs: load the tokenizer + a 4-bit quantized GGUF (~1 GB), feed in a chat-formatted prompt, and stream tokens out until <|end_of_text|> or <|eot_id|>.
cargo run --release -- --prompt "Explain RoPE in three sentences." -n 200-p, --prompt <PROMPT> Prompt to generate from
-n, --sample-len <N> Max tokens to generate (default: 200)
-t, --temperature <T> Sampling temperature (default: 0.8)
First run downloads two artifacts from the Hugging Face Hub:
unsloth/Llama-3.2-1B-Instruct→tokenizer.jsonbartowski/Llama-3.2-1B-Instruct-GGUF→Llama-3.2-1B-Instruct-Q4_K_M.gguf
They're cached under the standard ~/.cache/huggingface directory.
prompt (formatted with Llama-3 chat tokens)
│
▼
tokenizer.json ──► tokens (Vec<u32>)
│
▼
ModelWeights::from_gguf (Q4_K_M, ~1 GB)
│
▼
For each new token:
• slice the most recent `context_size` tokens
• Tensor::new(...).unsqueeze(0)
• model.forward(input, current_pos) → logits
• LogitsProcessor::sample(logits) → next_token
• print + push, advance current_pos
Inference runs on Metal when available (Device::new_metal(0)) and falls back to CPU otherwise.
candle-core,candle-nn,candle-transformers0.10.2with themetalfeaturetokenizers0.22hf-hubfor model/tokenizer downloadsclapfor the CLI
- Generation uses a simple temperature-based sampler (no top-k / top-p / repetition penalty).
- The KV cache is implicit in
model.forward(..., current_pos)— context size grows by 1 each step after the prompt is consumed. - Stop tokens are hard-coded for Llama-3 (
128001,128009).