Skip to content

feat: add deterministic sampling with optional seed#6501

Closed
beinan wants to merge 2 commits intolance-format:mainfrom
beinan:beinan/deterministic-sample
Closed

feat: add deterministic sampling with optional seed#6501
beinan wants to merge 2 commits intolance-format:mainfrom
beinan:beinan/deterministic-sample

Conversation

@beinan
Copy link
Copy Markdown
Contributor

@beinan beinan commented Apr 13, 2026

Summary

  • Add optional seed: Option<u64> parameter to Rust core Dataset::sample() for deterministic sampling
  • When a seed is provided, uses StdRng::seed_from_u64(seed) for reproducible random index selection
  • Without a seed, behavior is unchanged (uses rand::rng() with OS entropy)
  • Expose through Java JNI with Optional<Long> seed parameter
  • Useful for reproducible ML training/validation splits

Changes

  • rust/lance/src/dataset.rs: Add seed parameter; use StdRng when seeded, rand::rng() otherwise
  • rust/lance/src/dataset/tests/dataset_io.rs: Add test verifying same seed = same result, different seed = different result
  • rust/lance/src/index/vector/utils.rs, rust/lance/src/dataset/write/merge_insert.rs, rust/lance/benches/take.rs: Pass None for existing callers
  • java/lance-jni/src/blocking_dataset.rs: Pass seed through JNI
  • java/src/main/java/org/lance/Dataset.java: Add sample(n, columns, fragmentIds, seed) overload
  • java/src/test/java/org/lance/DatasetTest.java: Test deterministic sampling via seed

Test plan

  • Rust test test_sample_deterministic_with_seed — verifies same seed produces identical results, different seed produces different results
  • Java test testSample — verifies deterministic sampling with seed=42 produces identical results across two calls
  • All existing sample callers pass None for backward compatibility

Builds on #6500.

🤖 Generated with Claude Code

beinan and others added 2 commits April 13, 2026 20:43
Exposes the Rust core Dataset::sample() method through the Java JNI
interface, allowing Java users to randomly sample n rows with optional
fragment filtering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ic sampling

When a seed is provided, the same seed with the same dataset state will
always produce the same result. This is useful for reproducible ML
training/validation splits. Without a seed, sampling remains
non-deterministic as before.

Changes:
- Rust core: add `seed: Option<u64>` param to Dataset::sample(), use
  StdRng::seed_from_u64 when provided, fall back to rand::rng() when None
- Java JNI: pass seed through as Optional<Long>
- Tests: verify determinism (same seed = same result, different seed =
  different result)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added enhancement New feature or request java labels Apr 13, 2026
@beinan
Copy link
Copy Markdown
Contributor Author

beinan commented Apr 13, 2026

Superseded by #6502 — this PR had contamination from unrelated changes.

@beinan beinan closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant