Skip to content

sebastian-software/native-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

native-llm logo

native-llm

The easiest way to run AI models locally.

CI codecov npm version npm downloads License: MIT

Quick Start β€’ Why native-llm β€’ Models β€’ Documentation


πŸš€ Quick Start

npm install native-llm
import { LLMEngine } from "native-llm"

const engine = new LLMEngine({ model: "gemma" })

const result = await engine.generate({
  prompt: "Explain quantum computing to a 5-year-old"
})

console.log(result.text)

That's it. Model downloads automatically. GPU detected automatically. Just works.


🎯 Why native-llm?

A friendly wrapper around llama.cpp that handles the hard parts:

Without native-llm With native-llm
Find GGUF model URLs model: "gemma"
Configure HuggingFace auth Auto from HF_TOKEN
20+ lines boilerplate 3 lines
Research model benchmarks Curated recommendations

Local vs Cloud

☁️ Cloud AI 🏠 native-llm
Cost $0.001 - $0.10 per query Free forever
Speed 1-20 seconds < 100ms
Privacy Data sent to servers 100% local
Limits Rate limits & quotas Unlimited
Offline ❌ Requires internet βœ… Works offline

🎨 Models

Simple Aliases

new LLMEngine({ model: "gemma" }) // Best balance (default)
new LLMEngine({ model: "gemma-fast" }) // Maximum speed
new LLMEngine({ model: "qwen-coder" }) // Code generation
new LLMEngine({ model: "deepseek" }) // Complex reasoning

Smart Recommendations

import { LLMEngine } from "native-llm"

// Get the right model for your use case
const model = LLMEngine.getModelForUseCase("code") // β†’ qwen-2.5-coder-7b
const model = LLMEngine.getModelForUseCase("fast") // β†’ gemma-3n-e2b
const model = LLMEngine.getModelForUseCase("quality") // β†’ gemma-3-27b

// List all available models
const models = LLMEngine.listModels()
// β†’ [{ id: "gemma-3n-e4b", name: "Gemma 3n E4B", size: "5 GB", ... }, ...]

Performance (M1 Ultra)

Model Size Speed Best For
πŸš€ Gemma 3n E2B 3 GB 36 tok/s Maximum speed
⭐ Gemma 3n E4B 5 GB 18 tok/s Best balance
πŸ’» Qwen 2.5 Coder 5 GB 23 tok/s Code generation
🧠 DeepSeek R1 5 GB 9 tok/s Complex reasoning
πŸ‘‘ Gemma 3 27B 18 GB 5 tok/s Maximum quality

✨ Features

Feature Description
πŸ“¦ Zero Config Models download automatically, GPU detected automatically
🎯 Smart Defaults Curated models, sensible parameters, thinking-mode handled
πŸ”₯ Native Speed Direct llama.cpp bindings β€” no Python, no subprocess
🍎 Metal GPU Full Apple Silicon acceleration out of the box
πŸ–₯️ Cross-Platform macOS, Linux, Windows with CUDA support
🌊 Streaming Real-time token-by-token output
πŸ“ TypeScript Full type definitions included

πŸ”‘ Setup for Gemma Models

Gemma models require a free HuggingFace token:

export HF_TOKEN="hf_your_token_here"

Get yours in 30 seconds: huggingface.co/settings/tokens


πŸ“š Documentation

β†’ Full Documentation β€” Streaming, chat API, custom models, and more.

MIT License · Made with ❀️ by Sebastian Software
Powered by llama.cpp & node-llama-cpp

About

High-performance LLM inference in Node.js

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors