Skip to content

feat: add SQL vector_search table-valued function#574

Open
summaryzb wants to merge 6 commits into
lance-format:mainfrom
summaryzb:sql-vec
Open

feat: add SQL vector_search table-valued function#574
summaryzb wants to merge 6 commits into
lance-format:mainfrom
summaryzb:sql-vec

Conversation

@summaryzb
Copy link
Copy Markdown
Contributor

@summaryzb summaryzb commented Jun 2, 2026

Summary

This PR adds a Spark SQL vector_search table-valued function for querying Lance vector columns through the Lance Spark SQL extension. The implementation resolves catalog tables or direct Lance paths, executes Lance nearest-neighbor scans per fragment and applies a Spark-side global TopK by Lance's internal _distance value while keeping _distance out of user-visible results.

Problem

Resolve #65
Lance Spark already supports nearest-neighbor reads through read options, but SQL users did not have a first-class query form for vector search. Users needed a SQL extension entry point that works with both Lance catalog tables and direct dataset paths, validates query arguments consistently, preserves fragment-parallel scans, and returns globally ordered TopK results without exposing connector-internal distance metadata.

Approach

The PR registers Lance table-valued functions through LanceSparkSessionExtensions and adds a resolution rule that rewrites unresolved Lance TVF logical nodes once their arguments are resolved. vector_search(table, column, query_vector, limit) validates foldable literal arguments, resolves the target as either a catalog LanceDataset or path-based Lance dataset, builds a Lance nearest-neighbor Query, and passes it through LanceSparkReadOptions.CONFIG_NEAREST into a DataSourceV2Relation.

To support final ranking, the relation schema is widened with a reserved internal _distance column. Fragment scanners request _distance from Lance only for nearest-neighbor scans, while normal column projection continues to filter internal columns. The logical plan then sorts globally by _distance, applies local and global limits, and projects _distance away so the public result schema matches the source table. This explicitly adds the cross-fragment TopK merge that the earlier DataFrame nearest API did not perform. TopN pushdown is disabled when a nearest-neighbor query is active so Spark keeps control of the final global ordering step.

Notice

PathOnly table pattern changes from "file://" to "file:/"

Test Coverage

  • Catalog-table vector_search returns the expected global TopK result ordering.
  • vector_search keeps multiple fragment scan partitions and leaves final TopK ordering to Spark.
  • Internal _distance is not exposed in SELECT * results.
  • Direct-path vector_search works against a Lance dataset URI and applies the requested global limit.
  • Invalid non-positive limits are rejected with an IllegalArgumentException.

Change-Id: Iea19cc653222173dc193c43ce5313e19491d3046
@github-actions github-actions Bot added the enhancement New feature or request label Jun 2, 2026
summaryzb added 2 commits June 2, 2026 17:38
Change-Id: Ibe410a13a838d8f605a828d30d14da40ad39e353
Change-Id: Ia82a56d2b2f9e1bec7c0aef350e7f4b3a6997a02
summaryzb added 2 commits June 2, 2026 23:04
Change-Id: Ic3bb4811236e7528326608e2462ced8f1e0e625d
Change-Id: Id25498ba739a765d04ca71f6787d43b476724073
Comment thread docs/src/operations/dql/select.md Outdated
Comment thread docs/src/config.md Outdated
Change-Id: Id229c9dbce89128210c7460490835d98592c18f7
@LuciferYang
Copy link
Copy Markdown
Contributor

cc @hamersaw
also cc @Xuanwo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vector search SQL extension

2 participants