feat(optimizer): rewrite ORDER BY <distance> LIMIT k into Lance KNN scan#215
Open
martji wants to merge 1 commit into
Open
feat(optimizer): rewrite ORDER BY <distance> LIMIT k into Lance KNN scan#215martji wants to merge 1 commit into
martji wants to merge 1 commit into
Conversation
Standard SQL `ORDER BY array_distance(vec, q) LIMIT k` previously fell through DuckDB's TopN over a full Lance scan and ignored any IVF / HNSW vector index attached to the column. Using the index meant calling the bespoke `lance_vector_search` table function, breaking SQL idiom. The new OptimizerExtension matches `TopN -> Projection -> [Filter ->] GET(__lance_scan*)` and rewrites the GET to a new internal table function `__lance_knn_scan`, collapsing TopN and any pushable Filter into a single GET. A covering WHERE clause is encoded as Lance Filter IR so Lance pre-filters inside the index lookup. Session vars (`lance_knn_nprobes`, `lance_knn_refine_factor`, `lance_knn_prefilter`, `lance_knn_use_index`, `lance_knn_max_k`) flow into the rewritten scan; `use_index = false` acts as a kill switch, `max_k` (default 10000, hard ceiling 1000000) caps the rewrite to bound k inside the index scan. Distance functions covered: `array_distance` / `list_distance` (l2), `array_cosine_distance` / `list_cosine_distance` (cosine), `array_negative_inner_product` / `list_negative_inner_product` (dot). The metric is cross-checked against the column's vector index via a new `lance_dataset_vector_index_metric` FFI, cached per-column on LanceDatasetCacheEntry to avoid reopening index_statistics on every plan. On mismatch (or any of the 13 fallback conditions: missing index, OFFSET, DESC, multi-ORDER-BY, non-constant query vector, k > max_k, unpushable WHERE, ...) the optimizer returns the original plan unchanged. Hardening: - Dedup filter IR fragments across LogicalFilter, LIKE pushdown, and table_filters so LIKE + KNN no longer push the same predicate twice. - Re-verify index/metric in __lance_knn_scan rebind path; a clear "re-prepare" error replaces opaque downstream failure on cross- context drops. - Cap rebound query blob at 1 MiB to keep a typo'd internal call from allocating eagerly. Tests cover the rewrite path (basic), filter pushdown including the selective `prefilter=true` vs `false` row-count gap and IR-dedup byte window (filter), and every fallback condition plus the `max_k` lowering and `0 = default` branches (fallback). All three suites cross-check results against `lance_vector_search` via bidirectional EXCEPT. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Standard SQL
SELECT ... ORDER BY array_distance(vec, q) LIMIT kpreviously fell through DuckDB'sLogicalTopNover a full Lance scan and silently ignored any IVF / HNSWindex on the column — the only way to use the index was to switch to the bespoke
lance_vector_searchtable function.This PR adds an
OptimizerExtensionthat rewrites the standard pattern into an index-backed plan with no user-visible change.Design
The optimizer matches:
LogicalTopN(limit=k, offset=0, ASC dist)
└── LogicalProjection(... dist_func(vec, const_query) ...)
└── [LogicalFilter ...] // optional
└── LogicalGet(__lance_scan | __lance_table_scan | __lance_namespace_scan)
and replaces the underlying
LogicalGetwith a new internal__lance_knn_scan, collapsingTopNand any pushableFilterinto a single GET. A coveringWHEREclause isencoded as Lance Filter IR so Lance pre-filters inside the index lookup.
Distance functions
array_distance/list_distancel2array_cosine_distance/list_cosine_distancecosinearray_negative_inner_product/list_negative_inner_productdotl2andcosineare commutative; fordot, exactly one side must be a constantFLOAT[N]and the other a column reference.Metric consistency
A new FFI
lance_dataset_vector_index_metricreturns the index's metric for a column, cached per-column onLanceDatasetCacheEntry(positive lookups only, so transienterrors don't poison the entry and late
CREATE INDEXis picked up on the next plan). The optimizer falls back if the column has no vector index or the metric doesn't matchthe distance function.
Session variables
lance_knn_nprobes0(Lance default)lance_knn_refine_factor0lance_knn_prefiltertruelance_knn_use_indextruefalsedisables the rewritelance_knn_max_k10000(hard ceiling 1,000,000)kto bound index scan costFallback
13 conditions keep the original plan unchanged (DESC,
OFFSET, multi-ORDER-BY, non-constant query vector, unpushableWHERE, missing index, metric mismatch,k > max_k,etc.). Correctness is unchanged in every fallback case.
Hardening
LogicalFilter, LIKE pushdown, andtable_filterscan each contribute IR fragments; byte-equal fragments are deduplicated so LIKE + KNN don't pushthe same predicate twice.
__lance_knn_scanrebind path (prepared statement /EXPLAIN ANALYZEreplan) re-verifies the index and metric. A clear "re-prepare" errorreplaces an opaque downstream failure when an index is dropped across contexts.
Test plan
test/sql/optimizer_knn_basic.test— rewrite shape; bidirectionalEXCEPTagainstlance_vector_searchfor all three metricstest/sql/optimizer_knn_filter.test—WHEREpushdown,prefilter=truevsfalserow-count gap, IR-dedup byte window, LIKE + KNN interactiontest/sql/optimizer_knn_fallback.test— every fallback condition,lance_knn_max_klowering +0 = defaultbranch, lateCREATE INDEXproving the metric cache onlystores positive lookups
GEN=ninja make test: 4203 assertions / 53 cases pass