Skip to content

Performance: Low CPU utilization during embedding due to limited thread count and sequential context #74

Description

@riyadist

Problem

When running gno embed, CPU utilization stays low (~30-50%) even on multi-core systems. On a 16-core AMD Ryzen with 40GB RAM (no GPU), embedding speed is ~0.7 chunks/sec with bge-m3-Q4_K_M.gguf.

Root Causes

1. createEmbeddingContext() called without threads option

In src/llm/nodeLlamaCpp/embedding.ts, the embedding context is created with no options:

this.context = await llamaModel.createEmbeddingContext();

node-llama-cpp's LlamaEmbeddingContextOptions exposes a threads parameter:

threads?: number — number of threads to use to evaluate tokens. Set to 0 to use the maximum threads supported by the current machine hardware.

Suggested fix:

this.context = await llamaModel.createEmbeddingContext({ threads: 0 });

2. Single embedding context — sequential inference

MAX_CONCURRENT_EMBEDDINGS = 16 dispatches 16 JS promises concurrently, but they all share one LlamaEmbeddingContext. llama.cpp serializes requests through a single context, so real parallelism is 1x regardless of JS concurrency.

Suggested fix: Create a pool of N embedding contexts (e.g. Math.min(4, cpuCount / 4)), and distribute chunks across them in round-robin. Each context uses its own thread set. With 4 parallel contexts on a 16-core CPU, throughput could theoretically increase 3-4x.

3. getLlama() also missing numThreads

In src/llm/nodeLlamaCpp/lifecycle.ts:

this.llama = await getLlama({
  build: "autoAttempt",
  logLevel: LlamaLogLevel.error,
  // numThreads not set → uses llama.cpp default (often conservative)
});

Environment

  • OS: Windows 11 Pro
  • CPU: AMD Ryzen (16 logical cores)
  • RAM: 40 GB
  • GPU: None
  • GNO version: 0.24.0
  • Model: hf:gpustack/bge-m3-GGUF/bge-m3-Q4_K_M.gguf (slim preset)
  • node-llama-cpp: 3.18.1

Expected vs Actual

Actual Expected
Embed speed ~0.7 chunks/sec ~2-3 chunks/sec
CPU usage during embed ~30-50% ~80-90%
Threads used ~4-6 (estimated) 14-16

Workaround Attempts

  • Tried --batch-size 64 → no improvement (bottleneck is not JS batching)
  • Tried Ollama HTTP backend (nomic-embed-text) → slower (0.5 chunks/sec) due to per-request HTTP overhead
  • Tried patching createEmbeddingContext({ threads: 0 }) and getLlama({ numThreads }) in a local build → marginal improvement, suggesting the real bottleneck is the single-context serialization

Additional Notes

The fix for issue #1 (threads: 0) is a one-liner and low-risk. Issue #2 (context pool) is more involved but would give the biggest speedup for CPU-only users. Both changes would benefit anyone running GNO without a GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions