Skip to content

feat: batch chopping fallback for filtered read#6482

Open
westonpace wants to merge 3 commits intolance-format:mainfrom
westonpace:feat-batch-chopping-fallback-filtered-read
Open

feat: batch chopping fallback for filtered read#6482
westonpace wants to merge 3 commits intolance-format:mainfrom
westonpace:feat-batch-chopping-fallback-filtered-read

Conversation

@westonpace
Copy link
Copy Markdown
Member

Summary

  • Adds ChopBatchesStream, a stream wrapper that splits oversized batches (>1.5x target batch_size_bytes) into smaller sub-batches using zero-copy RecordBatch::slice
  • Wraps the filtered read output stream with ChopBatchesStream when batch_size_bytes is configured via FileReaderOptions
  • Serves as a safety net when the underlying file reader doesn't estimate batch sizes accurately enough

Stacked on feat/byte-sized-batches-file-reader — wait for that to merge first, then rebase this PR.

Test plan

  • Unit tests for ChopBatchesStream: splits large batches, passes small batches through, wrap_if_needed(None) is a no-op
  • cargo clippy clean
  • cargo fmt clean

🤖 Generated with Claude Code

@github-actions github-actions bot added enhancement New feature or request python labels Apr 10, 2026
…ytes

When batch_size_bytes is configured, wrap the filtered read stream in a
ChopBatchesStream that splits oversized batches (>1.5x target) into
smaller sub-batches. This provides a safety net when the underlying file
reader doesn't estimate batch sizes accurately enough.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@westonpace westonpace force-pushed the feat-batch-chopping-fallback-filtered-read branch from 909ebad to 3d60bdc Compare April 10, 2026 22:41
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 61.90476% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/io/exec/filtered_read.rs 42.30% 12 Missing and 3 partials ⚠️
rust/lance-arrow/src/stream.rs 93.75% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

westonpace and others added 2 commits April 13, 2026 10:00
…tream

Replace the custom ChopBatchesStream with the existing
rechunk_stream_by_size utility from lance-arrow, passing min_bytes=0
to disable coalescing and only perform the batch splitting behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The accumulation loop condition `acc_bytes < min_bytes` was never
entered when both started at 0, causing the stream to immediately
return None. Fix by always pulling at least one batch per iteration.

Add Rust unit tests for the min_bytes=0 case and a Python integration
test that verifies batch chopping works for large-string rows with
batch_size_bytes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@westonpace westonpace marked this pull request as ready for review April 13, 2026 17:26
@westonpace
Copy link
Copy Markdown
Member Author

Windows CI failure seems unrelated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants