Skip to content

Benchmark updates on duckdb#202

Merged
Xuanwo merged 2 commits into
lance-format:mainfrom
guillesd:benchmark-updates
May 7, 2026
Merged

Benchmark updates on duckdb#202
Xuanwo merged 2 commits into
lance-format:mainfrom
guillesd:benchmark-updates

Conversation

@guillesd
Copy link
Copy Markdown
Contributor

@guillesd guillesd commented May 4, 2026

The following has been modified:

  • vector_exact: DuckDB does not really do well with vss_match, so we rather keep it with the same plain approach as parquet.
  • vector_index: This is the biggest one, HNSW was not really firing because of the WHERE clause (current limitation). It also only accepts one column in the ORDER BY clause, that's why sample_id is removed everywhere for consistency. This does not modify the performance of lance.
  • hybrid_search: The same performance optimization from vector_index applies, we needed HNSW to fire so we removed the extra column in ORDER BY.
  • blob_read: Most of the performance was regarding the table scan, which without an index on the join column it was not very optimizable for DuckDB. We changed that and instead of returning octet_length we now deserialize the whole image since it feels more inline with "blob_read". I don't mind if we dooctet_length but then what are we really proving with this query?

HNSW was also dormant because SET hnsw_enable_experimental_persistence = true was never set in the query session

@prrao87 prrao87 requested a review from Xuanwo May 4, 2026 14:37
@Xuanwo
Copy link
Copy Markdown
Collaborator

Xuanwo commented May 7, 2026

Thanks, this makes sense to me.

The original goal of this benchmark was to cover a few different retrieval shapes on the same dataset: lexical search, exact vector search, indexed vector search, hybrid search, and blob-heavy reads. Some of the original workloads were also intended to reflect filtered retrieval, not just best-case indexed latency.

That said, I agree with the direction here. In particular, it makes sense to avoid query shapes that prevent HNSW from firing, and I agree that the previous vector_exact path was not a good representation for DuckDB.

I’m supportive of merging this. We can follow up on our side later to clarify the workload definitions and benchmark notes if needed.

@Xuanwo Xuanwo changed the title Benchmark updates Benchmark updates on duckdb May 7, 2026
@Xuanwo Xuanwo merged commit 5b1e3a4 into lance-format:main May 7, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants