feat: add deterministic sampling with optional seed#6501
Closed
beinan wants to merge 2 commits intolance-format:mainfrom
Closed
feat: add deterministic sampling with optional seed#6501beinan wants to merge 2 commits intolance-format:mainfrom
beinan wants to merge 2 commits intolance-format:mainfrom
Conversation
Exposes the Rust core Dataset::sample() method through the Java JNI interface, allowing Java users to randomly sample n rows with optional fragment filtering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ic sampling When a seed is provided, the same seed with the same dataset state will always produce the same result. This is useful for reproducible ML training/validation splits. Without a seed, sampling remains non-deterministic as before. Changes: - Rust core: add `seed: Option<u64>` param to Dataset::sample(), use StdRng::seed_from_u64 when provided, fall back to rand::rng() when None - Java JNI: pass seed through as Optional<Long> - Tests: verify determinism (same seed = same result, different seed = different result) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Superseded by #6502 — this PR had contamination from unrelated changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
seed: Option<u64>parameter to Rust coreDataset::sample()for deterministic samplingStdRng::seed_from_u64(seed)for reproducible random index selectionrand::rng()with OS entropy)Optional<Long> seedparameterChanges
rust/lance/src/dataset.rs: Addseedparameter; useStdRngwhen seeded,rand::rng()otherwiserust/lance/src/dataset/tests/dataset_io.rs: Add test verifying same seed = same result, different seed = different resultrust/lance/src/index/vector/utils.rs,rust/lance/src/dataset/write/merge_insert.rs,rust/lance/benches/take.rs: PassNonefor existing callersjava/lance-jni/src/blocking_dataset.rs: Pass seed through JNIjava/src/main/java/org/lance/Dataset.java: Addsample(n, columns, fragmentIds, seed)overloadjava/src/test/java/org/lance/DatasetTest.java: Test deterministic sampling via seedTest plan
test_sample_deterministic_with_seed— verifies same seed produces identical results, different seed produces different resultstestSample— verifies deterministic sampling with seed=42 produces identical results across two callssamplecallers passNonefor backward compatibilityBuilds on #6500.
🤖 Generated with Claude Code