Implement true LanceDB hybrid retrieval#2040
Conversation
Greptile SummaryThis PR replaces the
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/vdb/lancedb.py | Core hybrid retrieval implementation; adds query_texts plumbing, query builder branching, and defensive validation. Type annotation on retrieval() is now complete. Hardcoded fts_columns default may silently mismatch non-standard schemas. |
| nemo_retriever/src/nemo_retriever/vdb/operators.py | Correctly strips query_texts from persistent kwargs and re-injects at process() time only for hybrid=True. Verified by new unit tests. |
| nemo_retriever/src/nemo_retriever/retriever.py | Minimal: adds alignment-safety comment and passes query_texts into exec_kwargs for graph execution. |
| nemo_retriever/tests/test_lancedb_retrieval_where.py | Five new hybrid tests covering happy path, missing query_texts, length mismatch, WHERE filter, and conflicting query_type. |
| nemo_retriever/tests/test_nv_ingest_vdb_operator.py | Two new unit tests via FakeVDB verifying runtime query_texts replaces stale constructor value for hybrid, and is not forwarded for dense. |
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:587-588
The FTS column name is hardcoded to `"text"` as the default via `setdefault`. Tables whose text column is named differently (e.g. `"content"`, `"document"`, `"body"`) will silently search the wrong column — or fail at query time if no FTS index exists on `"text"`. The `LanceDB` constructor already accepts `text_column_name` for the write path; aligning the retrieval default with that value would keep both sides consistent.
```suggestion
search_kwargs["query_type"] = "hybrid"
search_kwargs.setdefault("fts_columns", self.text_column_name if hasattr(self, "text_column_name") else "text")
```
### Issue 2 of 2
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:612-620
**`idx` computed but unused in the dense path**
`enumerate` is used for both branches but `idx` is only read in the hybrid branch. A cleaner split would be two separate `for` loops — the hybrid one using `enumerate`, the dense one without — making the control flow self-documenting and removing the implicit coupling.
Reviews (2): Last reviewed commit: "Address LanceDB hybrid review comments" | Re-trigger Greptile
|
Greptile follow-up addressed in |
Summary
query_textsalongside precomputed vectors.query_textsexecution-only: it is stripped from persistent VDB constructor kwargs and forwarded only forhybrid=Trueretrieval calls.hybrid=TrueretrievalNotImplementedErrorwith LanceDB 0.30.2 hybrid query construction:table.search(query_type="hybrid", vector_column_name=..., fts_columns="text").vector(vector).text(query_text).Behavioral Notes
query_textsis not forwarded for dense retrieval.query_textsand validates that query/vector counts match.where/_filter,top_k,refine_factor,n_probe/nprobes,result_fields, andsearch_kwargsbehavior are preserved.search_kwargs["query_type"]now raises a clearValueError.Validation
cd /localhome/local-jioffe/nv-ingest-lancedb/nemo_retriever/localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_retriever_queries.py tests/test_nv_ingest_vdb_operator.py tests/test_lancedb_retrieval_where.py32 passed/localhome/local-jioffe/.local/bin/uv run --extra dev pytest -q tests/test_root_cli_workflow.py tests/test_graph_pipeline_cli.py tests/test_lancedb_write_policy.py21 passed, 1 warninggit diff --checkE2E Findings
JP20 LanceDB Hybrid
--no-extract-page-as-image.1,9403,1923,185recall@1: 0.6609recall@3: 0.8522recall@5: 0.9304recall@10: 0.9565IvfHnswSqplus FTStext_idx.BO767 LanceDB Hybrid
54,73080,43676,2991,0051484.75s/0:24:44.75336.86 PPSrecall@1: 0.5811recall@3: 0.7950recall@5: 0.8488recall@10: 0.8985ndcg@1: 0.5811ndcg@3: 0.7076ndcg@5: 0.7297ndcg@10: 0.746076,299rows and indexes:Index(IvfHnswSq, columns=["vector"], name="vector_idx")Index(FTS, columns=["text"], name="text_idx")Dense vs Hybrid Observability
recall@k,ndcg@k) and do not themselves indicate dense vs hybrid.VDB kwargs: {"hybrid": true, ...}.vdb_op: "lancedb"and metrics but does not includevdb_kwargsor an explicit retrieval mode.vdb_kwargsorretrieval_mode: hybrid|denseintorun.runtime.summary.jsonfor easier auditability in future runs.