-
Notifications
You must be signed in to change notification settings - Fork 330
Description
What happened?
When using LEANN to compute embeddings with the Ollama provider (--embedding-mode ollama), there is a significant performance issue during the document indexing phase. The process becomes extremely slow, even for a relatively small number of documents.
Analysis of the underlying code shows that the compute_embeddings_ollama function processes text chunks serially instead of using batching, which the Ollama /api/embeddings endpoint supports and is optimized for.
Specifically, the code iterates through each text chunk and sends an individual HTTP request to the Ollama API for each one. This results in massive overhead from repeated network calls and prevents the GPU from being used efficiently, as it cannot process chunks in parallel.
How to reproduce
Prepare a small to medium-sized codebase or a set of documents for indexing (e.g., 100-200 documents, resulting in ~2000-3000 text chunks).
Run the LEANN build/indexing command using an Ollama embedding model.
bash
leann build --embedding-model qwen3-embedding:0.6b --embedding-mode ollama
Observe the progress bar during the "Computing Ollama embeddings" step.
Expected Behavior:
The embedding process should be fast, leveraging the batch processing capabilities of the Ollama API. For a few thousand chunks, this process should complete in a few minutes at most on modern hardware, with each batch taking only a few seconds.
Actual Behavior:
The embedding process is extremely slow, taking upwards of 40-50 seconds per "batch" as shown by the progress bar. The total time is estimated to be close to an hour for just a few thousand chunks. This indicates that while the progress bar is tracking batches, the underlying API calls are being made serially for each chunk within that batch.
Looking at compute_embeddings_ollama (lines 722-769):
def get_batch_embeddings(batch_texts):
"""Get embeddings for a batch of texts."""
all_embeddings = []
failed_indices = []
for i, text in enumerate(batch_texts): # ← Sequential processing!
# ... retry logic ...
response = requests.post(
f"{resolved_host}/api/embeddings",
json={"model": model_name, "prompt": truncated_text}, # ← One text at a time
timeout=30,
)
The problem: Even though texts are divided into "batches" (lines 784-801), each text makes a separate sequential HTTP
request to Ollama. This is extremely slow.
Suggested Solution:
The compute_embeddings_ollama function should be refactored to send a list of text chunks (a batch) in a single API call to the /api/embeddings endpoint. The JSON payload should contain "prompt": ["text1", "text2", "text3", ...].
This change would dramatically reduce the number of HTTP requests and allow the backend (Ollama and the GPU) to process the embeddings in parallel, leading to a significant performance improvement that aligns with user expectations for local embedding generation.
Error message
LEANN Version
0.3.4
Operating System
macOS