feat: add SQL vector_search table-valued function#574
Open
summaryzb wants to merge 6 commits into
Open
Conversation
Change-Id: Iea19cc653222173dc193c43ce5313e19491d3046
Change-Id: Ibe410a13a838d8f605a828d30d14da40ad39e353
Change-Id: Ia82a56d2b2f9e1bec7c0aef350e7f4b3a6997a02
LuciferYang
reviewed
Jun 2, 2026
Change-Id: Ic3bb4811236e7528326608e2462ced8f1e0e625d
Change-Id: Id25498ba739a765d04ca71f6787d43b476724073
LuciferYang
reviewed
Jun 3, 2026
Change-Id: Id229c9dbce89128210c7460490835d98592c18f7
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a Spark SQL
vector_searchtable-valued function for querying Lance vector columns through the Lance Spark SQL extension. The implementation resolves catalog tables or direct Lance paths, executes Lance nearest-neighbor scans per fragment and applies a Spark-side global TopK by Lance's internal_distancevalue while keeping_distanceout of user-visible results.Problem
Resolve #65
Lance Spark already supports nearest-neighbor reads through read options, but SQL users did not have a first-class query form for vector search. Users needed a SQL extension entry point that works with both Lance catalog tables and direct dataset paths, validates query arguments consistently, preserves fragment-parallel scans, and returns globally ordered TopK results without exposing connector-internal distance metadata.
Approach
The PR registers Lance table-valued functions through
LanceSparkSessionExtensionsand adds a resolution rule that rewrites unresolved Lance TVF logical nodes once their arguments are resolved.vector_search(table, column, query_vector, limit)validates foldable literal arguments, resolves the target as either a catalogLanceDatasetor path-based Lance dataset, builds a Lance nearest-neighborQuery, and passes it throughLanceSparkReadOptions.CONFIG_NEARESTinto aDataSourceV2Relation.To support final ranking, the relation schema is widened with a reserved internal
_distancecolumn. Fragment scanners request_distancefrom Lance only for nearest-neighbor scans, while normal column projection continues to filter internal columns. The logical plan then sorts globally by_distance, applies local and global limits, and projects_distanceaway so the public result schema matches the source table. This explicitly adds the cross-fragment TopK merge that the earlier DataFrame nearest API did not perform. TopN pushdown is disabled when a nearest-neighbor query is active so Spark keeps control of the final global ordering step.Notice
PathOnly table pattern changes from "file://" to "file:/"
Test Coverage
vector_searchreturns the expected global TopK result ordering.vector_searchkeeps multiple fragment scan partitions and leaves final TopK ordering to Spark._distanceis not exposed inSELECT *results.vector_searchworks against a Lance dataset URI and applies the requested global limit.IllegalArgumentException.