perf: intern DataFile fields/column_indices to reduce manifest memory by beinan · Pull Request #6477 · lance-format/lance

beinan · 2026-04-10T17:40:25Z

Summary

Change DataFile.fields and DataFile.column_indices from Vec<i32> to Arc<[i32]> so that fragments with identical field lists share a single heap allocation
Add DataFileFieldInterner that deduplicates these slices during manifest deserialization
In homogeneous tables (the common case), every fragment carries the same field list, so at 20M fragments this saves ~2.4 GB of redundant heap allocations

Motivation

When dataset manifests grow large (>1 GB with millions of fragments), opening the dataset becomes very expensive in terms of memory. Each DataFile previously owned its own Vec<i32> for fields and column_indices, even though in most tables every fragment has the exact same field list. This PR deduplicates those allocations at deserialization time.

Per-fragment memory breakdown (before)

Field	Size per fragment
`fields: Vec<i32>` (10 fields)	~64 bytes
`column_indices: Vec<i32>` (10 cols)	~64 bytes
Total redundant	~128 bytes x 20M = ~2.4 GB

After this change

With interning, all 20M fragments share a single Arc<[i32]> allocation (~80 bytes total instead of 2.4 GB).

Changes

lance-table/src/format/fragment.rs — Core struct change (Vec<i32> → Arc<[i32]>), custom Serialize/Deserialize impls, and DataFileFieldInterner
lance-table/src/format/manifest.rs — Use interner during manifest deserialization
lance/src/dataset/fragment.rs, merge_insert.rs, io/commit.rs — Tombstoning and field-remapping rebuilt as new Arc<[i32]> instead of in-place mutation
python/src/fragment.rs, java/lance-jni/src/fragment.rs — FFI boundary conversions
Various test files — Updated struct literals and assertions

Compatibility

No format change — protobuf schema is unchanged
Serde JSON output is identical (custom impl serializes Arc<[i32]> as [i32])
All public API signatures that take Vec<i32> (e.g., DataFile::new(), Fragment::add_file()) still accept Vec<i32> and convert internally

Test plan

cargo check --workspace --tests passes
cargo clippy -p lance-table -p lance -- -D warnings passes
All 88 lance-table tests pass
Fragment JSON serialization round-trip test passes
Fragment write, tombstoning, binary copy, batch commit tests pass
CI

🤖 Generated with Claude Code

Change DataFile.fields and DataFile.column_indices from Vec<i32> to Arc<[i32]> and add a DataFileFieldInterner that deduplicates identical slices during manifest deserialization. In homogeneous tables every fragment carries the same field list, so at 20M fragments the interning saves ~2.4 GB of redundant heap allocations (~1.2 GB for fields + ~1.2 GB for column_indices). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jackye1995

thanks for the change, looks good to me!

codecov · 2026-04-13T18:09:54Z

Codecov Report

❌ Patch coverage is 89.04110% with 16 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-table/src/format/fragment.rs	80.55%	0 Missing and 14 partials ⚠️
rust/lance/src/io/commit/conflict_resolver.rs	33.33%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

…mory (#6499) ## Summary - Change `RowDatasetVersionMeta::Inline` from `Vec<u8>` to `Arc<[u8]>` so that fragments with identical version metadata share a single heap allocation - Extend `DataFileFieldInterner` to deduplicate these inline byte payloads during manifest deserialization - Introduce `InternCache<T>`: a hybrid cache that uses Vec linear scan for ≤16 entries and upgrades to HashMap for larger caches - Add custom `Serialize`/`Deserialize` impls for `RowDatasetVersionMeta` to handle `Arc<[u8]>` transparently ## Motivation Follow-up to #6477 (interning `DataFile.fields`/`column_indices`). After a compaction, all fragments are stamped with the same version metadata (both `last_updated_at_version_meta` and `created_at_version_meta`), but each fragment previously owned its own `Vec<u8>` copy. ### Per-fragment memory breakdown (before) | Field | Size per fragment | |-------|------------------| | `last_updated_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload | | `created_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload | | **Total redundant at 20M fragments** | **~480 MB+** | ### After this change With interning, all 20M fragments share a single `Arc<[u8]>` allocation per unique payload. ## Benchmark results Microbenchmark at 100K fragments (10 fields per fragment): | Scenario | No interning | With interning | Delta | |----------|-------------|----------------|-------| | **Uniform (1 unique version)** | 24.5 ms | 17.9 ms | **27% faster** | | **Diverse (10 unique)** | 25.7 ms | 19.7 ms | **23% faster** | | **Diverse (100 unique)** | 26.0 ms | 23.4 ms | **10% faster** | | **Diverse (500 unique)** | 26.0 ms | 22.8 ms | **12% faster** | | Memory (100K fragments) | No interning | With interning | Savings | |------------------------|-------------|----------------|---------| | **10 fields** | 39.47 MB | 29.74 MB | **24.6%** | | **50 fields** | 69.99 MB | 29.74 MB | **57.5%** | Both memory and speed improve across all scenarios. The hybrid `InternCache` uses fast Vec scan for the common case (1-3 unique values) and upgrades to HashMap when diversity exceeds 16 entries. Run with: `cargo bench -p lance-table --bench manifest_intern` ## Changes - **`rust/lance-table/src/rowids/version.rs`** — `Inline(Vec<u8>)` → `Inline(Arc<[u8]>)`, custom serde impls, updated protobuf conversions - **`rust/lance-table/src/format/fragment.rs`** — `InternCache<T>` (Vec/HashMap hybrid), extended `DataFileFieldInterner` with version meta interning - **`rust/lance-table/benches/manifest_intern.rs`** — Microbenchmark covering uniform and diverse scenarios ## Compatibility - No format change — protobuf schema is unchanged - Serde JSON output is identical (custom impl serializes `Arc<[u8]>` as `[u8]`) - `from_sequence()` still works as before (converts internally) ## Test plan - [x] `cargo check --workspace --tests` passes - [x] `cargo clippy -p lance-table -p lance -- -D warnings` passes - [x] All 88 `lance-table` tests pass - [x] `cargo fmt --all -- --check` passes - [x] Microbenchmark validates performance across uniform and diverse scenarios - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added python performance java labels Apr 10, 2026

beinan and others added 3 commits April 13, 2026 17:32

style: apply cargo fmt

13c7c72

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: update debug.rs to use to_vec() for Arc<[i32]> fields

90fa6d7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

beinan force-pushed the beinan/intern-datafile-fields branch from 933f46a to 90fa6d7 Compare April 13, 2026 17:35

jackye1995 approved these changes Apr 13, 2026

View reviewed changes

jackye1995 merged commit 9931ace into lance-format:main Apr 13, 2026
29 checks passed

beinan mentioned this pull request Apr 13, 2026

perf: intern RowDatasetVersionMeta inline bytes to reduce manifest memory #6499

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: intern DataFile fields/column_indices to reduce manifest memory#6477

perf: intern DataFile fields/column_indices to reduce manifest memory#6477
jackye1995 merged 3 commits intolance-format:mainfrom
beinan:beinan/intern-datafile-fields

beinan commented Apr 10, 2026

Uh oh!

jackye1995 left a comment

Uh oh!

codecov bot commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

beinan commented Apr 10, 2026

Summary

Motivation

Per-fragment memory breakdown (before)

After this change

Changes

Compatibility

Test plan

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 13, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants