Skip to content

perf: intern DataFile fields/column_indices to reduce manifest memory#6477

Merged
jackye1995 merged 3 commits intolance-format:mainfrom
beinan:beinan/intern-datafile-fields
Apr 13, 2026
Merged

perf: intern DataFile fields/column_indices to reduce manifest memory#6477
jackye1995 merged 3 commits intolance-format:mainfrom
beinan:beinan/intern-datafile-fields

Conversation

@beinan
Copy link
Copy Markdown
Contributor

@beinan beinan commented Apr 10, 2026

Summary

  • Change DataFile.fields and DataFile.column_indices from Vec<i32> to Arc<[i32]> so that fragments with identical field lists share a single heap allocation
  • Add DataFileFieldInterner that deduplicates these slices during manifest deserialization
  • In homogeneous tables (the common case), every fragment carries the same field list, so at 20M fragments this saves ~2.4 GB of redundant heap allocations

Motivation

When dataset manifests grow large (>1 GB with millions of fragments), opening the dataset becomes very expensive in terms of memory. Each DataFile previously owned its own Vec<i32> for fields and column_indices, even though in most tables every fragment has the exact same field list. This PR deduplicates those allocations at deserialization time.

Per-fragment memory breakdown (before)

Field Size per fragment
fields: Vec<i32> (10 fields) ~64 bytes
column_indices: Vec<i32> (10 cols) ~64 bytes
Total redundant ~128 bytes x 20M = ~2.4 GB

After this change

With interning, all 20M fragments share a single Arc<[i32]> allocation (~80 bytes total instead of 2.4 GB).

Changes

  • lance-table/src/format/fragment.rs — Core struct change (Vec<i32>Arc<[i32]>), custom Serialize/Deserialize impls, and DataFileFieldInterner
  • lance-table/src/format/manifest.rs — Use interner during manifest deserialization
  • lance/src/dataset/fragment.rs, merge_insert.rs, io/commit.rs — Tombstoning and field-remapping rebuilt as new Arc<[i32]> instead of in-place mutation
  • python/src/fragment.rs, java/lance-jni/src/fragment.rs — FFI boundary conversions
  • Various test files — Updated struct literals and assertions

Compatibility

  • No format change — protobuf schema is unchanged
  • Serde JSON output is identical (custom impl serializes Arc<[i32]> as [i32])
  • All public API signatures that take Vec<i32> (e.g., DataFile::new(), Fragment::add_file()) still accept Vec<i32> and convert internally

Test plan

  • cargo check --workspace --tests passes
  • cargo clippy -p lance-table -p lance -- -D warnings passes
  • All 88 lance-table tests pass
  • Fragment JSON serialization round-trip test passes
  • Fragment write, tombstoning, binary copy, batch commit tests pass
  • CI

🤖 Generated with Claude Code

beinan and others added 3 commits April 13, 2026 17:32
Change DataFile.fields and DataFile.column_indices from Vec<i32> to
Arc<[i32]> and add a DataFileFieldInterner that deduplicates identical
slices during manifest deserialization.

In homogeneous tables every fragment carries the same field list, so at
20M fragments the interning saves ~2.4 GB of redundant heap allocations
(~1.2 GB for fields + ~1.2 GB for column_indices).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@beinan beinan force-pushed the beinan/intern-datafile-fields branch from 933f46a to 90fa6d7 Compare April 13, 2026 17:35
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the change, looks good to me!

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 89.04110% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-table/src/format/fragment.rs 80.55% 0 Missing and 14 partials ⚠️
rust/lance/src/io/commit/conflict_resolver.rs 33.33% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@jackye1995 jackye1995 merged commit 9931ace into lance-format:main Apr 13, 2026
29 checks passed
jackye1995 pushed a commit that referenced this pull request Apr 14, 2026
…mory (#6499)

## Summary

- Change `RowDatasetVersionMeta::Inline` from `Vec<u8>` to `Arc<[u8]>`
so that fragments with identical version metadata share a single heap
allocation
- Extend `DataFileFieldInterner` to deduplicate these inline byte
payloads during manifest deserialization
- Introduce `InternCache<T>`: a hybrid cache that uses Vec linear scan
for ≤16 entries and upgrades to HashMap for larger caches
- Add custom `Serialize`/`Deserialize` impls for `RowDatasetVersionMeta`
to handle `Arc<[u8]>` transparently

## Motivation

Follow-up to #6477 (interning `DataFile.fields`/`column_indices`). After
a compaction, all fragments are stamped with the same version metadata
(both `last_updated_at_version_meta` and `created_at_version_meta`), but
each fragment previously owned its own `Vec<u8>` copy.

### Per-fragment memory breakdown (before)

| Field | Size per fragment |
|-------|------------------|
| `last_updated_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload
|
| `created_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload |
| **Total redundant at 20M fragments** | **~480 MB+** |

### After this change

With interning, all 20M fragments share a single `Arc<[u8]>` allocation
per unique payload.

## Benchmark results

Microbenchmark at 100K fragments (10 fields per fragment):

| Scenario | No interning | With interning | Delta |
|----------|-------------|----------------|-------|
| **Uniform (1 unique version)** | 24.5 ms | 17.9 ms | **27% faster** |
| **Diverse (10 unique)** | 25.7 ms | 19.7 ms | **23% faster** |
| **Diverse (100 unique)** | 26.0 ms | 23.4 ms | **10% faster** |
| **Diverse (500 unique)** | 26.0 ms | 22.8 ms | **12% faster** |

| Memory (100K fragments) | No interning | With interning | Savings |
|------------------------|-------------|----------------|---------|
| **10 fields** | 39.47 MB | 29.74 MB | **24.6%** |
| **50 fields** | 69.99 MB | 29.74 MB | **57.5%** |

Both memory and speed improve across all scenarios. The hybrid
`InternCache` uses fast Vec scan for the common case (1-3 unique values)
and upgrades to HashMap when diversity exceeds 16 entries.

Run with: `cargo bench -p lance-table --bench manifest_intern`

## Changes

- **`rust/lance-table/src/rowids/version.rs`** — `Inline(Vec<u8>)` →
`Inline(Arc<[u8]>)`, custom serde impls, updated protobuf conversions
- **`rust/lance-table/src/format/fragment.rs`** — `InternCache<T>`
(Vec/HashMap hybrid), extended `DataFileFieldInterner` with version meta
interning
- **`rust/lance-table/benches/manifest_intern.rs`** — Microbenchmark
covering uniform and diverse scenarios

## Compatibility

- No format change — protobuf schema is unchanged
- Serde JSON output is identical (custom impl serializes `Arc<[u8]>` as
`[u8]`)
- `from_sequence()` still works as before (converts internally)

## Test plan

- [x] `cargo check --workspace --tests` passes
- [x] `cargo clippy -p lance-table -p lance -- -D warnings` passes
- [x] All 88 `lance-table` tests pass
- [x] `cargo fmt --all -- --check` passes
- [x] Microbenchmark validates performance across uniform and diverse
scenarios
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants