Benchmark updates on duckdb#202
Merged
Merged
Conversation
Collaborator
|
Thanks, this makes sense to me. The original goal of this benchmark was to cover a few different retrieval shapes on the same dataset: lexical search, exact vector search, indexed vector search, hybrid search, and blob-heavy reads. Some of the original workloads were also intended to reflect filtered retrieval, not just best-case indexed latency. That said, I agree with the direction here. In particular, it makes sense to avoid query shapes that prevent HNSW from firing, and I agree that the previous I’m supportive of merging this. We can follow up on our side later to clarify the workload definitions and benchmark notes if needed. |
Xuanwo
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The following has been modified:
vector_exact: DuckDB does not really do well with vss_match, so we rather keep it with the same plain approach as parquet.vector_index: This is the biggest one,HNSWwas not really firing because of theWHEREclause (current limitation). It also only accepts one column in theORDER BYclause, that's whysample_idis removed everywhere for consistency. This does not modify the performance of lance.hybrid_search: The same performance optimization from vector_index applies, we needed HNSW to fire so we removed the extra column inORDER BY.blob_read: Most of the performance was regarding the table scan, which without an index on the join column it was not very optimizable for DuckDB. We changed that and instead of returningoctet_lengthwe now deserialize the whole image since it feels more inline with "blob_read". I don't mind if we dooctet_lengthbut then what are we really proving with this query?