Skip to content

Implement true LanceDB hybrid retrieval#2040

Open
jioffe502 wants to merge 3 commits into
NVIDIA:mainfrom
jioffe502:codex/lancedb-true-hybrid-search
Open

Implement true LanceDB hybrid retrieval#2040
jioffe502 wants to merge 3 commits into
NVIDIA:mainfrom
jioffe502:codex/lancedb-true-hybrid-search

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

Summary

  • Implements real LanceDB hybrid retrieval by passing aligned raw query_texts alongside precomputed vectors.
  • Keeps query_texts execution-only: it is stripped from persistent VDB constructor kwargs and forwarded only for hybrid=True retrieval calls.
  • Replaces the LanceDB hybrid=True retrieval NotImplementedError with LanceDB 0.30.2 hybrid query construction: table.search(query_type="hybrid", vector_column_name=..., fts_columns="text").vector(vector).text(query_text).

Behavioral Notes

  • No CLI surface changes.
  • Existing overwrite/append semantics are unchanged.
  • Dense retrieval stays VDB-agnostic; query_texts is not forwarded for dense retrieval.
  • Hybrid LanceDB retrieval now requires query_texts and validates that query/vector counts match.
  • where / _filter, top_k, refine_factor, n_probe / nprobes, result_fields, and search_kwargs behavior are preserved.
  • A conflicting non-hybrid search_kwargs["query_type"] now raises a clear ValueError.

Validation

  • cd /localhome/local-jioffe/nv-ingest-lancedb/nemo_retriever
  • /localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_retriever_queries.py tests/test_nv_ingest_vdb_operator.py tests/test_lancedb_retrieval_where.py
    • 32 passed
  • /localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_root_cli_workflow.py tests/test_graph_pipeline_cli.py tests/test_lancedb_write_policy.py
    • 21 passed, 1 warning
  • git diff --check
    • clean

E2E Findings

JP20 LanceDB Hybrid

  • Page image extraction enabled: yes, default path; no --no-extract-page-as-image.
  • Pages processed: 1,940
  • Graph rows: 3,192
  • Persisted/uploadable rows: 3,185
  • Recall:
    • recall@1: 0.6609
    • recall@3: 0.8522
    • recall@5: 0.9304
    • recall@10: 0.9565
  • LanceDB indexes confirmed: vector IvfHnswSq plus FTS text_idx.

BO767 LanceDB Hybrid

  • Pages processed: 54,730
  • Graph rows: 80,436
  • Persisted LanceDB rows: 76,299
  • BEIR queries: 1,005
  • Total time: 1484.75s / 0:24:44.753
  • Throughput: 36.86 PPS
  • Recall:
    • recall@1: 0.5811
    • recall@3: 0.7950
    • recall@5: 0.8488
    • recall@10: 0.8985
  • NDCG:
    • ndcg@1: 0.5811
    • ndcg@3: 0.7076
    • ndcg@5: 0.7297
    • ndcg@10: 0.7460
  • LanceDB table confirmed with 76,299 rows and indexes:
    • Index(IvfHnswSq, columns=["vector"], name="vector_idx")
    • Index(FTS, columns=["text"], name="text_idx")

Dense vs Hybrid Observability

  • BEIR metric names are unchanged (recall@k, ndcg@k) and do not themselves indicate dense vs hybrid.
  • The run summary stdout includes VDB kwargs: {"hybrid": true, ...}.
  • The runtime summary JSON currently records vdb_op: "lancedb" and metrics but does not include vdb_kwargs or an explicit retrieval mode.
  • Recommended follow-up: persist vdb_kwargs or retrieval_mode: hybrid|dense into run.runtime.summary.json for easier auditability in future runs.

@jioffe502 jioffe502 requested review from a team as code owners May 14, 2026 21:21
@jioffe502 jioffe502 requested a review from edknv May 14, 2026 21:21
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR replaces the NotImplementedError stub for LanceDB hybrid retrieval with a working implementation using LanceDB 0.30.2's chained builder API. It threads query_texts through the operator stack as an execution-time parameter, stripped from persistent kwargs and re-injected only for hybrid calls.

  • Core retrieval path: LanceDB.retrieval materialises vectors/query_texts into aligned lists, validates lengths, and builds a per-query hybrid builder chain.
  • Operator plumbing: RetrieveVdbOperator strips stale query_texts at construction and re-injects runtime values only when hybrid=True.
  • Test coverage: Five new integration tests and two new unit tests cover the happy path, missing/mismatched guards, WHERE filter, and conflicting query_type detection.

Confidence Score: 5/5

Safe to merge; the hybrid path is well-guarded and does not affect the dense retrieval path.

The implementation correctly strips stale query_texts from persistent kwargs, re-injects runtime values only for hybrid, validates alignment before the query loop, and passes 53 author-run tests with no broken contracts identified.

No files require special attention; the two style observations in lancedb.py are non-blocking.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/vdb/lancedb.py Core hybrid retrieval implementation; adds query_texts plumbing, query builder branching, and defensive validation. Type annotation on retrieval() is now complete. Hardcoded fts_columns default may silently mismatch non-standard schemas.
nemo_retriever/src/nemo_retriever/vdb/operators.py Correctly strips query_texts from persistent kwargs and re-injects at process() time only for hybrid=True. Verified by new unit tests.
nemo_retriever/src/nemo_retriever/retriever.py Minimal: adds alignment-safety comment and passes query_texts into exec_kwargs for graph execution.
nemo_retriever/tests/test_lancedb_retrieval_where.py Five new hybrid tests covering happy path, missing query_texts, length mismatch, WHERE filter, and conflicting query_type.
nemo_retriever/tests/test_nv_ingest_vdb_operator.py Two new unit tests via FakeVDB verifying runtime query_texts replaces stale constructor value for hybrid, and is not forwarded for dense.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:587-588
The FTS column name is hardcoded to `"text"` as the default via `setdefault`. Tables whose text column is named differently (e.g. `"content"`, `"document"`, `"body"`) will silently search the wrong column — or fail at query time if no FTS index exists on `"text"`. The `LanceDB` constructor already accepts `text_column_name` for the write path; aligning the retrieval default with that value would keep both sides consistent.

```suggestion
            search_kwargs["query_type"] = "hybrid"
            search_kwargs.setdefault("fts_columns", self.text_column_name if hasattr(self, "text_column_name") else "text")
```

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:612-620
**`idx` computed but unused in the dense path**

`enumerate` is used for both branches but `idx` is only read in the hybrid branch. A cleaner split would be two separate `for` loops — the hybrid one using `enumerate`, the dense one without — making the control flow self-documenting and removing the implicit coupling.

Reviews (2): Last reviewed commit: "Address LanceDB hybrid review comments" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated
@jioffe502
Copy link
Copy Markdown
Collaborator Author

Greptile follow-up addressed in d96dcf47: added retrieval type hints, restored dense lazy iteration, and clarified the missing query_texts error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant