rust-micro-llm

A minimal CLI that runs Llama-3.2-1B-Instruct locally in Rust on Apple Silicon (Metal), built on top of candle and Hugging Face's GGUF tooling.

The whole program is one ~90-line main.rs: load the tokenizer + a 4-bit quantized GGUF (~1 GB), feed in a chat-formatted prompt, and stream tokens out until <|end_of_text|> or <|eot_id|>.

Usage

cargo run --release -- --prompt "Explain RoPE in three sentences." -n 200

-p, --prompt <PROMPT>     Prompt to generate from
-n, --sample-len <N>      Max tokens to generate (default: 200)
-t, --temperature <T>     Sampling temperature (default: 0.8)

First run downloads two artifacts from the Hugging Face Hub:

unsloth/Llama-3.2-1B-Instruct → tokenizer.json
bartowski/Llama-3.2-1B-Instruct-GGUF → Llama-3.2-1B-Instruct-Q4_K_M.gguf

They're cached under the standard ~/.cache/huggingface directory.

How it works

prompt (formatted with Llama-3 chat tokens)
   │
   ▼
tokenizer.json  ──►  tokens (Vec<u32>)
                            │
                            ▼
        ModelWeights::from_gguf  (Q4_K_M, ~1 GB)
                            │
                            ▼
        For each new token:
          • slice the most recent `context_size` tokens
          • Tensor::new(...).unsqueeze(0)
          • model.forward(input, current_pos)  → logits
          • LogitsProcessor::sample(logits)    → next_token
          • print + push, advance current_pos

Inference runs on Metal when available (Device::new_metal(0)) and falls back to CPU otherwise.

Stack

candle-core, candle-nn, candle-transformers 0.10.2 with the metal feature
tokenizers 0.22
hf-hub for model/tokenizer downloads
clap for the CLI

Notes

Generation uses a simple temperature-based sampler (no top-k / top-p / repetition penalty).
The KV cache is implicit in model.forward(..., current_pos) — context size grows by 1 each step after the prompt is consumed.
Stop tokens are hard-coded for Llama-3 (128001, 128009).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rust-micro-llm

Usage

How it works

Stack

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rust-micro-llm

Usage

How it works

Stack

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages