The easiest way to run AI models locally.
Quick Start β’ Why native-llm β’ Models β’ Documentation
npm install native-llmimport { LLMEngine } from "native-llm"
const engine = new LLMEngine({ model: "gemma" })
const result = await engine.generate({
prompt: "Explain quantum computing to a 5-year-old"
})
console.log(result.text)That's it. Model downloads automatically. GPU detected automatically. Just works.
A friendly wrapper around llama.cpp that handles the hard parts:
| Without native-llm | With native-llm |
|---|---|
| Find GGUF model URLs | model: "gemma" |
| Configure HuggingFace auth | Auto from HF_TOKEN |
| 20+ lines boilerplate | 3 lines |
| Research model benchmarks | Curated recommendations |
| βοΈ Cloud AI | π native-llm | |
|---|---|---|
| Cost | $0.001 - $0.10 per query | Free forever |
| Speed | 1-20 seconds | < 100ms |
| Privacy | Data sent to servers | 100% local |
| Limits | Rate limits & quotas | Unlimited |
| Offline | β Requires internet | β Works offline |
new LLMEngine({ model: "gemma" }) // Best balance (default)
new LLMEngine({ model: "gemma-fast" }) // Maximum speed
new LLMEngine({ model: "qwen-coder" }) // Code generation
new LLMEngine({ model: "deepseek" }) // Complex reasoningimport { LLMEngine } from "native-llm"
// Get the right model for your use case
const model = LLMEngine.getModelForUseCase("code") // β qwen-2.5-coder-7b
const model = LLMEngine.getModelForUseCase("fast") // β gemma-3n-e2b
const model = LLMEngine.getModelForUseCase("quality") // β gemma-3-27b
// List all available models
const models = LLMEngine.listModels()
// β [{ id: "gemma-3n-e4b", name: "Gemma 3n E4B", size: "5 GB", ... }, ...]| Model | Size | Speed | Best For |
|---|---|---|---|
| π Gemma 3n E2B | 3 GB | 36 tok/s | Maximum speed |
| β Gemma 3n E4B | 5 GB | 18 tok/s | Best balance |
| π» Qwen 2.5 Coder | 5 GB | 23 tok/s | Code generation |
| π§ DeepSeek R1 | 5 GB | 9 tok/s | Complex reasoning |
| π Gemma 3 27B | 18 GB | 5 tok/s | Maximum quality |
| Feature | Description |
|---|---|
| π¦ Zero Config | Models download automatically, GPU detected automatically |
| π― Smart Defaults | Curated models, sensible parameters, thinking-mode handled |
| π₯ Native Speed | Direct llama.cpp bindings β no Python, no subprocess |
| π Metal GPU | Full Apple Silicon acceleration out of the box |
| π₯οΈ Cross-Platform | macOS, Linux, Windows with CUDA support |
| π Streaming | Real-time token-by-token output |
| π TypeScript | Full type definitions included |
Gemma models require a free HuggingFace token:
export HF_TOKEN="hf_your_token_here"Get yours in 30 seconds: huggingface.co/settings/tokens
β Full Documentation β Streaming, chat API, custom models, and more.
MIT License Β· Made with β€οΈ by Sebastian Software
Powered by llama.cpp & node-llama-cpp