Skip to content

feat: add ANN proto codecs and extract table_identifier module#6503

Open
LuQQiu wants to merge 13 commits intolance-format:mainfrom
LuQQiu:lu/ann
Open

feat: add ANN proto codecs and extract table_identifier module#6503
LuQQiu wants to merge 13 commits intolance-format:mainfrom
LuQQiu:lu/ann

Conversation

@LuQQiu
Copy link
Copy Markdown
Contributor

@LuQQiu LuQQiu commented Apr 13, 2026

Add protobuf encode/decode for ANNIvfSubIndexExec

@github-actions github-actions bot added the enhancement New feature or request label Apr 13, 2026
@LuQQiu LuQQiu requested review from BubbleCal and westonpace April 13, 2026 23:50
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

message ANNIvfSubIndexExecProto {
VectorQueryProto query = 1;
TableIdentifier table = 2;
repeated lance.table.IndexMetadata indices = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to serialize prefilter_source for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefilter source is

  Why this works: PreFilterSource is one of three variants:
  - FilteredRowIds(Arc<dyn ExecutionPlan>) — an execution
  plan child
  - ScalarIndexQuery(Arc<dyn ExecutionPlan>) — an execution
  plan child
  - None — no prefilter

When ANNIvfSubIndexExec has a prefilter, it reports the prefilter plan as a child in children() .
DataFusion's proto serialization framework automatically serializes/deserializes all children. So the codec receives the decoded children in the inputs slice and reconstructs the PreFilterSource from them. It doesn' need to be in the proto. @westonpace FYI

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but approve pending @BubbleCal 's observation re: prefilter source

Is the eventual goal to use PhysicalExtensionCodec?

float dist_q_c = 12;
}

message ANNIvfSubIndexExecProto {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just add a brief comment that the purpose of this message is to serialize the ANNIvfSubIndexExec node? (Though that might be obvious from the name)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, added

protos/ann.proto Outdated
// Serialized vector query parameters.
message VectorQueryProto {
// Query vector as Arrow IPC bytes (supports Float16, Float32, Float64, UInt8, etc.)
bytes key_arrow_ipc = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do query_arrow_ipc or query_vector_arrow_ipc? We don't use the term key anywhere else that I know of.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to query_vector_arrow_ipc

protos/ann.proto Outdated
optional uint32 refine_factor = 9;
// Distance metric type as string ("l2", "cosine", "dot", "hamming").
// Absent means None (use the index's default metric).
optional string metric_type = 10;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use

enum VectorMetricType {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to use that
but because lance-index depends on lance-datafusion
protos/ann.proto cannot belongs to lance-datafusion
i have to move that to lance crate

protos/ann.proto Outdated
// Absent means None (use the index's default metric).
optional string metric_type = 10;
bool use_index = 11;
float dist_q_c = 12;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be optional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to optional

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checked float dist_q_c is not optional in Query struct, callers send 0.0 explicitly

@westonpace
Copy link
Copy Markdown
Member

The Cargo.lock for python needs updated.

LuQQiu and others added 8 commits April 14, 2026 10:19
Add serialization support for ANNIvfPartitionExec and ANNIvfSubIndexExec
execution plan nodes, enabling distributed ANN query execution.

- Add ann_ivf.proto with VectorQueryProto, ANNIvfPartitionExecProto,
  ANNIvfSubIndexExecProto
- Add knn_proto.rs with encode/decode for ANN IVF exec nodes
- Extract table_identifier.rs from filtered_read_proto.rs for reuse
- Add query/dataset/indices accessors on ANNIvfSubIndexExec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove ANNIvfPartitionExecProto (will be handled separately)
- Use lance.table.IndexMetadata directly via extern_path instead of
  raw bytes encoding
- Move table_identifier tests from filtered_read_proto to
  table_identifier module

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename test_array_ipc_roundtrip_* to test_query_key_ipc_roundtrip_*
- Move resolve_dataset to table_identifier.rs, use from both
  ann_proto.rs and filtered_read_proto.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Type enum

Address PR review feedback:
- Move ann.proto from lance-datafusion to lance crate (package lance.pb)
  so it can reference types from lance-index without circular deps
- Use lance.index.pb.VectorMetricType enum instead of string for metric_type
- Rename key_arrow_ipc to query_vector_arrow_ipc
- Make minimum_nprobes and dist_q_c optional
- Add comment about prefilter_source handled by DataFusion child serialization
- Add comment explaining ANNIvfSubIndexExecProto purpose

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ery_key → query_vector

- Remove extern_path and lance-table dep from lance-datafusion (only
  needed when ann.proto was compiled there)
- Rename query_key_to/from_ipc_bytes → query_vector_to/from_ipc_bytes
  to match the proto field name query_vector_arrow_ipc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LuQQiu and others added 5 commits April 14, 2026 11:13
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t to 1

- use_index and dist_q_c are required values (no centralized default),
  keep them as bare proto fields
- minimum_nprobes: optional with unwrap_or(1) matching documented default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Needed by distributed execution codecs to inspect the prefilter
source type when rebuilding plans for remote PEs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@LuQQiu
Copy link
Copy Markdown
Contributor Author

LuQQiu commented Apr 14, 2026

@westonpace @BubbleCal resolve the comments, please take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants