Skip to content

feat(optimizer): rewrite ORDER BY <distance> LIMIT k into Lance KNN scan#215

Open
martji wants to merge 1 commit into
lance-format:mainfrom
martji:feat/optimizer-knn-rewrite
Open

feat(optimizer): rewrite ORDER BY <distance> LIMIT k into Lance KNN scan#215
martji wants to merge 1 commit into
lance-format:mainfrom
martji:feat/optimizer-knn-rewrite

Conversation

@martji
Copy link
Copy Markdown

@martji martji commented May 26, 2026

Summary

Standard SQL SELECT ... ORDER BY array_distance(vec, q) LIMIT k previously fell through DuckDB's LogicalTopN over a full Lance scan and silently ignored any IVF / HNSW
index on the column — the only way to use the index was to switch to the bespoke lance_vector_search table function.

This PR adds an OptimizerExtension that rewrites the standard pattern into an index-backed plan with no user-visible change.

Design

The optimizer matches:

LogicalTopN(limit=k, offset=0, ASC dist)
└── LogicalProjection(... dist_func(vec, const_query) ...)
└── [LogicalFilter ...] // optional
└── LogicalGet(__lance_scan | __lance_table_scan | __lance_namespace_scan)

and replaces the underlying LogicalGet with a new internal __lance_knn_scan, collapsing TopN and any pushable Filter into a single GET. A covering WHERE clause is
encoded as Lance Filter IR so Lance pre-filters inside the index lookup.

Distance functions

SQL function Metric
array_distance / list_distance l2
array_cosine_distance / list_cosine_distance cosine
array_negative_inner_product / list_negative_inner_product dot

l2 and cosine are commutative; for dot, exactly one side must be a constant FLOAT[N] and the other a column reference.

Metric consistency

A new FFI lance_dataset_vector_index_metric returns the index's metric for a column, cached per-column on LanceDatasetCacheEntry (positive lookups only, so transient
errors don't poison the entry and late CREATE INDEX is picked up on the next plan). The optimizer falls back if the column has no vector index or the metric doesn't match
the distance function.

Session variables

Variable Default Purpose
lance_knn_nprobes 0 (Lance default) IVF probe count
lance_knn_refine_factor 0 PQ refine factor
lance_knn_prefilter true Apply filter before / after the scan
lance_knn_use_index true Kill switch — false disables the rewrite
lance_knn_max_k 10000 (hard ceiling 1,000,000) Caps k to bound index scan cost

Fallback

13 conditions keep the original plan unchanged (DESC, OFFSET, multi-ORDER-BY, non-constant query vector, unpushable WHERE, missing index, metric mismatch, k > max_k,
etc.). Correctness is unchanged in every fallback case.

Hardening

  • Filter IR dedupLogicalFilter, LIKE pushdown, and table_filters can each contribute IR fragments; byte-equal fragments are deduplicated so LIKE + KNN don't push
    the same predicate twice.
  • Rebind verification — The __lance_knn_scan rebind path (prepared statement / EXPLAIN ANALYZE replan) re-verifies the index and metric. A clear "re-prepare" error
    replaces an opaque downstream failure when an index is dropped across contexts.
  • 1 MiB blob cap — Rebound query / filter IR blobs are capped at 1 MiB; a typo'd internal call cannot trigger an eager multi-GB allocation.

Test plan

  • test/sql/optimizer_knn_basic.test — rewrite shape; bidirectional EXCEPT against lance_vector_search for all three metrics
  • test/sql/optimizer_knn_filter.testWHERE pushdown, prefilter=true vs false row-count gap, IR-dedup byte window, LIKE + KNN interaction
  • test/sql/optimizer_knn_fallback.test — every fallback condition, lance_knn_max_k lowering + 0 = default branch, late CREATE INDEX proving the metric cache only
    stores positive lookups
  • Full regression GEN=ninja make test: 4203 assertions / 53 cases pass

Standard SQL `ORDER BY array_distance(vec, q) LIMIT k` previously fell
through DuckDB's TopN over a full Lance scan and ignored any IVF / HNSW
vector index attached to the column. Using the index meant calling the
bespoke `lance_vector_search` table function, breaking SQL idiom.

The new OptimizerExtension matches `TopN -> Projection -> [Filter ->]
GET(__lance_scan*)` and rewrites the GET to a new internal table
function `__lance_knn_scan`, collapsing TopN and any pushable Filter
into a single GET. A covering WHERE clause is encoded as Lance Filter IR
so Lance pre-filters inside the index lookup. Session vars
(`lance_knn_nprobes`, `lance_knn_refine_factor`, `lance_knn_prefilter`,
`lance_knn_use_index`, `lance_knn_max_k`) flow into the rewritten scan;
`use_index = false` acts as a kill switch, `max_k` (default 10000, hard
ceiling 1000000) caps the rewrite to bound k inside the index scan.

Distance functions covered: `array_distance` / `list_distance` (l2),
`array_cosine_distance` / `list_cosine_distance` (cosine),
`array_negative_inner_product` / `list_negative_inner_product` (dot).
The metric is cross-checked against the column's vector index via a
new `lance_dataset_vector_index_metric` FFI, cached per-column on
LanceDatasetCacheEntry to avoid reopening index_statistics on every
plan. On mismatch (or any of the 13 fallback conditions: missing
index, OFFSET, DESC, multi-ORDER-BY, non-constant query vector,
k > max_k, unpushable WHERE, ...) the optimizer returns the original
plan unchanged.

Hardening:
- Dedup filter IR fragments across LogicalFilter, LIKE pushdown, and
  table_filters so LIKE + KNN no longer push the same predicate twice.
- Re-verify index/metric in __lance_knn_scan rebind path; a clear
  "re-prepare" error replaces opaque downstream failure on cross-
  context drops.
- Cap rebound query blob at 1 MiB to keep a typo'd internal call from
  allocating eagerly.

Tests cover the rewrite path (basic), filter pushdown including the
selective `prefilter=true` vs `false` row-count gap and IR-dedup byte
window (filter), and every fallback condition plus the `max_k`
lowering and `0 = default` branches (fallback). All three suites
cross-check results against `lance_vector_search` via bidirectional
EXCEPT.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant