Performance: Low CPU utilization during embedding due to limited thread count and sequential context

## Problem

When running `gno embed`, CPU utilization stays low (~30-50%) even on multi-core systems. On a 16-core AMD Ryzen with 40GB RAM (no GPU), embedding speed is ~0.7 chunks/sec with `bge-m3-Q4_K_M.gguf`.

## Root Causes

### 1. `createEmbeddingContext()` called without `threads` option

In `src/llm/nodeLlamaCpp/embedding.ts`, the embedding context is created with no options:

```typescript
this.context = await llamaModel.createEmbeddingContext();
```

`node-llama-cpp`'s `LlamaEmbeddingContextOptions` exposes a `threads` parameter:

> **`threads?: number`** — number of threads to use to evaluate tokens. Set to `0` to use the maximum threads supported by the current machine hardware.

Suggested fix:
```typescript
this.context = await llamaModel.createEmbeddingContext({ threads: 0 });
```

### 2. Single embedding context — sequential inference

`MAX_CONCURRENT_EMBEDDINGS = 16` dispatches 16 JS promises concurrently, but they all share **one** `LlamaEmbeddingContext`. llama.cpp serializes requests through a single context, so real parallelism is 1x regardless of JS concurrency.

**Suggested fix:** Create a pool of `N` embedding contexts (e.g. `Math.min(4, cpuCount / 4)`), and distribute chunks across them in round-robin. Each context uses its own thread set. With 4 parallel contexts on a 16-core CPU, throughput could theoretically increase 3-4x.

### 3. `getLlama()` also missing `numThreads`

In `src/llm/nodeLlamaCpp/lifecycle.ts`:

```typescript
this.llama = await getLlama({
  build: "autoAttempt",
  logLevel: LlamaLogLevel.error,
  // numThreads not set → uses llama.cpp default (often conservative)
});
```

## Environment

- **OS:** Windows 11 Pro
- **CPU:** AMD Ryzen (16 logical cores)
- **RAM:** 40 GB
- **GPU:** None
- **GNO version:** 0.24.0
- **Model:** `hf:gpustack/bge-m3-GGUF/bge-m3-Q4_K_M.gguf` (slim preset)
- **node-llama-cpp:** 3.18.1

## Expected vs Actual

| | Actual | Expected |
|---|---|---|
| Embed speed | ~0.7 chunks/sec | ~2-3 chunks/sec |
| CPU usage during embed | ~30-50% | ~80-90% |
| Threads used | ~4-6 (estimated) | 14-16 |

## Workaround Attempts

- Tried `--batch-size 64` → no improvement (bottleneck is not JS batching)
- Tried Ollama HTTP backend (`nomic-embed-text`) → slower (0.5 chunks/sec) due to per-request HTTP overhead
- Tried patching `createEmbeddingContext({ threads: 0 })` and `getLlama({ numThreads })` in a local build → marginal improvement, suggesting the real bottleneck is the single-context serialization

## Additional Notes

The fix for issue #1 (`threads: 0`) is a one-liner and low-risk. Issue #2 (context pool) is more involved but would give the biggest speedup for CPU-only users. Both changes would benefit anyone running GNO without a GPU.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Low CPU utilization during embedding due to limited thread count and sequential context #74

Problem

Root Causes

1. `createEmbeddingContext()` called without `threads` option

2. Single embedding context — sequential inference

3. `getLlama()` also missing `numThreads`

Environment

Expected vs Actual

Workaround Attempts

Additional Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	Actual	Expected
Embed speed	~0.7 chunks/sec	~2-3 chunks/sec
CPU usage during embed	~30-50%	~80-90%
Threads used	~4-6 (estimated)	14-16

Performance: Low CPU utilization during embedding due to limited thread count and sequential context #74

Description

Problem

Root Causes

1. createEmbeddingContext() called without threads option

2. Single embedding context — sequential inference

3. getLlama() also missing numThreads

Environment

Expected vs Actual

Workaround Attempts

Additional Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `createEmbeddingContext()` called without `threads` option

3. `getLlama()` also missing `numThreads`