Problem
When running gno embed, CPU utilization stays low (~30-50%) even on multi-core systems. On a 16-core AMD Ryzen with 40GB RAM (no GPU), embedding speed is ~0.7 chunks/sec with bge-m3-Q4_K_M.gguf.
Root Causes
1. createEmbeddingContext() called without threads option
In src/llm/nodeLlamaCpp/embedding.ts, the embedding context is created with no options:
this.context = await llamaModel.createEmbeddingContext();
node-llama-cpp's LlamaEmbeddingContextOptions exposes a threads parameter:
threads?: number — number of threads to use to evaluate tokens. Set to 0 to use the maximum threads supported by the current machine hardware.
Suggested fix:
this.context = await llamaModel.createEmbeddingContext({ threads: 0 });
2. Single embedding context — sequential inference
MAX_CONCURRENT_EMBEDDINGS = 16 dispatches 16 JS promises concurrently, but they all share one LlamaEmbeddingContext. llama.cpp serializes requests through a single context, so real parallelism is 1x regardless of JS concurrency.
Suggested fix: Create a pool of N embedding contexts (e.g. Math.min(4, cpuCount / 4)), and distribute chunks across them in round-robin. Each context uses its own thread set. With 4 parallel contexts on a 16-core CPU, throughput could theoretically increase 3-4x.
3. getLlama() also missing numThreads
In src/llm/nodeLlamaCpp/lifecycle.ts:
this.llama = await getLlama({
build: "autoAttempt",
logLevel: LlamaLogLevel.error,
// numThreads not set → uses llama.cpp default (often conservative)
});
Environment
- OS: Windows 11 Pro
- CPU: AMD Ryzen (16 logical cores)
- RAM: 40 GB
- GPU: None
- GNO version: 0.24.0
- Model:
hf:gpustack/bge-m3-GGUF/bge-m3-Q4_K_M.gguf (slim preset)
- node-llama-cpp: 3.18.1
Expected vs Actual
|
Actual |
Expected |
| Embed speed |
~0.7 chunks/sec |
~2-3 chunks/sec |
| CPU usage during embed |
~30-50% |
~80-90% |
| Threads used |
~4-6 (estimated) |
14-16 |
Workaround Attempts
- Tried
--batch-size 64 → no improvement (bottleneck is not JS batching)
- Tried Ollama HTTP backend (
nomic-embed-text) → slower (0.5 chunks/sec) due to per-request HTTP overhead
- Tried patching
createEmbeddingContext({ threads: 0 }) and getLlama({ numThreads }) in a local build → marginal improvement, suggesting the real bottleneck is the single-context serialization
Additional Notes
The fix for issue #1 (threads: 0) is a one-liner and low-risk. Issue #2 (context pool) is more involved but would give the biggest speedup for CPU-only users. Both changes would benefit anyone running GNO without a GPU.
Problem
When running
gno embed, CPU utilization stays low (~30-50%) even on multi-core systems. On a 16-core AMD Ryzen with 40GB RAM (no GPU), embedding speed is ~0.7 chunks/sec withbge-m3-Q4_K_M.gguf.Root Causes
1.
createEmbeddingContext()called withoutthreadsoptionIn
src/llm/nodeLlamaCpp/embedding.ts, the embedding context is created with no options:node-llama-cpp'sLlamaEmbeddingContextOptionsexposes athreadsparameter:Suggested fix:
2. Single embedding context — sequential inference
MAX_CONCURRENT_EMBEDDINGS = 16dispatches 16 JS promises concurrently, but they all share oneLlamaEmbeddingContext. llama.cpp serializes requests through a single context, so real parallelism is 1x regardless of JS concurrency.Suggested fix: Create a pool of
Nembedding contexts (e.g.Math.min(4, cpuCount / 4)), and distribute chunks across them in round-robin. Each context uses its own thread set. With 4 parallel contexts on a 16-core CPU, throughput could theoretically increase 3-4x.3.
getLlama()also missingnumThreadsIn
src/llm/nodeLlamaCpp/lifecycle.ts:Environment
hf:gpustack/bge-m3-GGUF/bge-m3-Q4_K_M.gguf(slim preset)Expected vs Actual
Workaround Attempts
--batch-size 64→ no improvement (bottleneck is not JS batching)nomic-embed-text) → slower (0.5 chunks/sec) due to per-request HTTP overheadcreateEmbeddingContext({ threads: 0 })andgetLlama({ numThreads })in a local build → marginal improvement, suggesting the real bottleneck is the single-context serializationAdditional Notes
The fix for issue #1 (
threads: 0) is a one-liner and low-risk. Issue #2 (context pool) is more involved but would give the biggest speedup for CPU-only users. Both changes would benefit anyone running GNO without a GPU.