feat: add ANN proto codecs and extract table_identifier module#6503
feat: add ANN proto codecs and extract table_identifier module#6503LuQQiu wants to merge 13 commits intolance-format:mainfrom
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| message ANNIvfSubIndexExecProto { | ||
| VectorQueryProto query = 1; | ||
| TableIdentifier table = 2; | ||
| repeated lance.table.IndexMetadata indices = 3; |
There was a problem hiding this comment.
do we need to serialize prefilter_source for this?
There was a problem hiding this comment.
prefilter source is
Why this works: PreFilterSource is one of three variants:
- FilteredRowIds(Arc<dyn ExecutionPlan>) — an execution
plan child
- ScalarIndexQuery(Arc<dyn ExecutionPlan>) — an execution
plan child
- None — no prefilter
When ANNIvfSubIndexExec has a prefilter, it reports the prefilter plan as a child in children() .
DataFusion's proto serialization framework automatically serializes/deserializes all children. So the codec receives the decoded children in the inputs slice and reconstructs the PreFilterSource from them. It doesn' need to be in the proto. @westonpace FYI
westonpace
left a comment
There was a problem hiding this comment.
A few nits but approve pending @BubbleCal 's observation re: prefilter source
Is the eventual goal to use PhysicalExtensionCodec?
| float dist_q_c = 12; | ||
| } | ||
|
|
||
| message ANNIvfSubIndexExecProto { |
There was a problem hiding this comment.
Can we just add a brief comment that the purpose of this message is to serialize the ANNIvfSubIndexExec node? (Though that might be obvious from the name)
protos/ann.proto
Outdated
| // Serialized vector query parameters. | ||
| message VectorQueryProto { | ||
| // Query vector as Arrow IPC bytes (supports Float16, Float32, Float64, UInt8, etc.) | ||
| bytes key_arrow_ipc = 1; |
There was a problem hiding this comment.
Can we do query_arrow_ipc or query_vector_arrow_ipc? We don't use the term key anywhere else that I know of.
There was a problem hiding this comment.
changed to query_vector_arrow_ipc
protos/ann.proto
Outdated
| optional uint32 refine_factor = 9; | ||
| // Distance metric type as string ("l2", "cosine", "dot", "hamming"). | ||
| // Absent means None (use the index's default metric). | ||
| optional string metric_type = 10; |
There was a problem hiding this comment.
Change to use that
but because lance-index depends on lance-datafusion
protos/ann.proto cannot belongs to lance-datafusion
i have to move that to lance crate
protos/ann.proto
Outdated
| // Absent means None (use the index's default metric). | ||
| optional string metric_type = 10; | ||
| bool use_index = 11; | ||
| float dist_q_c = 12; |
There was a problem hiding this comment.
Double checked float dist_q_c is not optional in Query struct, callers send 0.0 explicitly
|
The |
Add serialization support for ANNIvfPartitionExec and ANNIvfSubIndexExec execution plan nodes, enabling distributed ANN query execution. - Add ann_ivf.proto with VectorQueryProto, ANNIvfPartitionExecProto, ANNIvfSubIndexExecProto - Add knn_proto.rs with encode/decode for ANN IVF exec nodes - Extract table_identifier.rs from filtered_read_proto.rs for reuse - Add query/dataset/indices accessors on ANNIvfSubIndexExec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove ANNIvfPartitionExecProto (will be handled separately) - Use lance.table.IndexMetadata directly via extern_path instead of raw bytes encoding - Move table_identifier tests from filtered_read_proto to table_identifier module Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename test_array_ipc_roundtrip_* to test_query_key_ipc_roundtrip_* - Move resolve_dataset to table_identifier.rs, use from both ann_proto.rs and filtered_read_proto.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Type enum Address PR review feedback: - Move ann.proto from lance-datafusion to lance crate (package lance.pb) so it can reference types from lance-index without circular deps - Use lance.index.pb.VectorMetricType enum instead of string for metric_type - Rename key_arrow_ipc to query_vector_arrow_ipc - Make minimum_nprobes and dist_q_c optional - Add comment about prefilter_source handled by DataFusion child serialization - Add comment explaining ANNIvfSubIndexExecProto purpose Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ery_key → query_vector - Remove extern_path and lance-table dep from lance-datafusion (only needed when ann.proto was compiled there) - Rename query_key_to/from_ipc_bytes → query_vector_to/from_ipc_bytes to match the proto field name query_vector_arrow_ipc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t to 1 - use_index and dist_q_c are required values (no centralized default), keep them as bare proto fields - minimum_nprobes: optional with unwrap_or(1) matching documented default Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Needed by distributed execution codecs to inspect the prefilter source type when rebuilding plans for remote PEs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@westonpace @BubbleCal resolve the comments, please take a look |
Add protobuf encode/decode for
ANNIvfSubIndexExec