Conversation
Release v0.5.3
Release 0.6.0
Release v0.6.1
Implements trim, promoters, reduce, setdiff, and pintersect as a trait on RegionSet in gtars-genomicdist. Uses natural chromosome sort order, preserves zero-width intervals, and saturates at 0 for promoters. Includes 26 unit tests, demo binary, and benchmark example. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4218d9c to
6f465f2
Compare
khoroshevskyi
left a comment
There was a problem hiding this comment.
LGTM
I have just few questions, and suggestions!
| fn chrom_cmp(a: &str, b: &str) -> Ordering { | ||
| chrom_sort_key(a).cmp(&chrom_sort_key(b)) | ||
| } | ||
|
|
||
| /// Returns (priority, numeric_value, suffix) for sorting. | ||
| /// Priority 0 = numeric chrN, 1 = chrX/Y/M, 2 = everything else. | ||
| fn chrom_sort_key(chr: &str) -> (u8, u32, String) { | ||
| let suffix = chr | ||
| .strip_prefix("chr") | ||
| .or_else(|| chr.strip_prefix("Chr")) | ||
| .unwrap_or(chr); | ||
|
|
||
| if let Ok(n) = suffix.parse::<u32>() { | ||
| (0, n, String::new()) | ||
| } else { | ||
| match suffix { | ||
| "X" | "x" => (1, 0, String::new()), | ||
| "Y" | "y" => (1, 1, String::new()), | ||
| "M" | "m" | "MT" | "mt" => (1, 2, String::new()), | ||
| _ => (2, 0, chr.to_string()), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Why do we need to have natural sort for bed files?
There was a problem hiding this comment.
Addressed. Switched from natural sort to lexicographic sort, which matches BED convention and what bedToBigBed expects. Removed the chrom_cmp/chrom_sort_key helpers entirely.
| let mut sorted = self.regions.clone(); | ||
| sorted.sort_by(|a, b| chrom_cmp(&a.chr, &b.chr).then_with(|| a.start.cmp(&b.start))); |
There was a problem hiding this comment.
Again here, why do we want to have natural sort. All bed files are already sorted, but they are sorted using usual sort, and that's what bedtobigbed also wants
There was a problem hiding this comment.
Looking in the code, I think we don't need to sort rs
There was a problem hiding this comment.
Addressed both points:
-
Natural sort → lexicographic: Switched to
a.chr.cmp(&b.chr), matching BED convention. Removed the natural sort helpers. -
"Don't need to sort": The sort is still necessary because
reduce()can receive unsorted input frommerge_regionsets()(used ingenome_partition_list),promoters(), or programmatically constructed RegionSets. Added a comment explaining this.
There was a problem hiding this comment.
What do you think of: having two different versions of some of these functions, one that assumes sorted and one that does not? then you can stream in the case of assuming sorted, and sometimes you can make that bound.
There was a problem hiding this comment.
Good idea. We actually added an is_sorted flag on RegionSet that handles this automatically — reduce() checks the flag and skips the clone+sort when the data is known-sorted. This gave ~27% speedup on partition construction where reduce() is called ~8 times on already-sorted data.
The flag approach works well here because genome_partition_list chains multiple operations (promoters() → reduce() → setdiff() → reduce()) and it's non-obvious which intermediate RegionSets are sorted. The flag tracks that through the chain so the right path is chosen without the caller having to think about it.
Exposing explicit reduce() / reduce_sorted() variants could also work as a power-user API. The risk is that calling reduce_sorted() on unsorted data gives silently wrong results (missed merges). The flag approach avoids that — it's set by sort() and checked internally, so correctness is guaranteed.
Could add both: flag for the automatic path, plus explicit variants for users who want to bypass the check. But it's more API surface for the same outcome, so holding off unless there's a concrete use case.
| /// | ||
| /// Removes portions of `self` that overlap with `other`. Both inputs are | ||
| /// reduced internally before subtraction. Operates per-chromosome with a | ||
| /// sweep-line algorithm. |
There was a problem hiding this comment.
| /// sweep-line algorithm. | |
| /// sweep-line algorithm. | |
| /// | |
| /// Example: | |
| /// Input: | |
| /// A: chr1 100–200 | |
| /// B: chr1 120–140 | |
| /// chr1 160–180 | |
| /// Output: | |
| /// 100–120 | |
| /// 140–160 | |
| /// 180–200 | |
There was a problem hiding this comment.
Added. The setdiff docstring now includes a concrete example:
A: chr1 100–200
B: chr1 120–140, chr1 160–180
setdiff(A, B): chr1 100–120, chr1 140–160, chr1 180–200
| /// Pairwise intersection of two region sets of equal length. | ||
| /// | ||
| /// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`. | ||
| /// Pairs with no overlap or mismatched chromosomes are dropped. |
There was a problem hiding this comment.
| /// Pairs with no overlap or mismatched chromosomes are dropped. | |
| /// Pairs with no overlap or mismatched chromosomes are dropped. | |
| /// | |
| /// Example: | |
| /// A: chr1 100–200 | |
| /// chr1 300–400 | |
| /// chr2 100–150 | |
| /// | |
| /// B: chr1 150–250 | |
| /// chr1 350–360 | |
| /// chr2 200–300 | |
| /// Return: | |
| /// chr1 150–200 | |
| /// chr1 350–360 | |
| /// | |
| /// | |
| /// ! It intersects by index, not by genomic overlap across all regions. |
There was a problem hiding this comment.
Added. The pintersect docstring now includes a concrete example and the clarifying note:
Note: this intersects by index position (1st with 1st, 2nd with 2nd, etc.), not by genomic overlap across all regions.
Also documented the truncation behavior when sets differ in length.
There was a problem hiding this comment.
Pull request overview
This PR introduces genomic interval set algebra operations to the gtars-genomicdist crate, porting functionality similar to R's GenomicRanges/IRanges package. The implementation provides five core operations (trim, promoters, reduce, setdiff, pintersect) with comprehensive tests and examples.
Changes:
- Added
IntervalRangestrait with five genomic interval operations using 0-based half-open BED coordinates - Implemented natural chromosome sorting (chr1, chr2, chr10, chrX, chrY, chrM) rather than lexicographic
- Added demo and benchmark examples with cross-language validation support
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| gtars-genomicdist/src/interval_ranges.rs | Core implementation of IntervalRanges trait with trim, promoters, reduce, setdiff, and pintersect operations, plus comprehensive test suite |
| gtars-genomicdist/src/lib.rs | Module declaration and public exports for interval_ranges |
| tests/data/regionset/dummy_b.bed | Test data file for setdiff operation validation |
| gtars-genomicdist/examples/interval_ranges_demo.rs | Interactive demo showing all five operations on user-provided BED files |
| gtars-genomicdist/examples/benchmark_interval_ranges.rs | Comprehensive benchmark suite with synthetic data generation and CSV output |
| gtars-genomicdist/Cargo.toml | Added rand dependency for benchmark synthetic data generation |
| .github/workflows/rust-publish.yml | Added gtars-genomicdist to publishable crates and cleaned trailing whitespace |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub trait IntervalRanges { | ||
| /// Clamp regions to chromosome boundaries. | ||
| /// | ||
| /// Regions extending past chromosome ends are trimmed to `[0, chrom_size)`. | ||
| /// Regions on chromosomes not present in `chrom_sizes` are dropped. | ||
| /// Empty regions (start >= end after clamping) are dropped. | ||
| fn trim(&self, chrom_sizes: &HashMap<String, u32>) -> RegionSet; | ||
|
|
||
| /// Generate promoter regions relative to each region's start position. | ||
| /// | ||
| /// For each region, produces `[start - upstream, start + downstream)`. | ||
| /// Uses saturating subtraction at coordinate 0. | ||
| fn promoters(&self, upstream: u32, downstream: u32) -> RegionSet; | ||
|
|
||
| /// Merge overlapping and adjacent intervals per chromosome. | ||
| /// | ||
| /// Sorts by (chr, start), then sweeps to merge intervals where | ||
| /// `next.start <= current.end`. Returns a minimal set of non-overlapping regions. | ||
| fn reduce(&self) -> RegionSet; | ||
|
|
||
| /// Subtract one region set from another (set difference). | ||
| /// | ||
| /// Removes portions of `self` that overlap with `other`. Both inputs are | ||
| /// reduced internally before subtraction. Operates per-chromosome with a | ||
| /// sweep-line algorithm. | ||
| fn setdiff(&self, other: &RegionSet) -> RegionSet; | ||
|
|
||
| /// Pairwise intersection of two region sets of equal length. | ||
| /// | ||
| /// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`. | ||
| /// Pairs with no overlap or mismatched chromosomes are dropped. | ||
| fn pintersect(&self, other: &RegionSet) -> RegionSet; | ||
| } |
There was a problem hiding this comment.
All IntervalRanges operations discard the rest field from input regions by setting it to None in the output. This behavior should be documented in the trait documentation to make it clear that extra BED columns (name, score, strand, etc.) are not preserved through these operations. Consider adding a note to the module-level documentation or trait documentation.
There was a problem hiding this comment.
Addressed. Added a note to the IntervalRanges trait doc:
Note: All operations produce regions with
rest: None. Metadata from therestfield of input regions is not preserved — operations likereduce()merge multiple regions, so there is no unambiguousrestto carry forward.
| /// | ||
| /// Regions extending past chromosome ends are trimmed to `[0, chrom_size)`. | ||
| /// Regions on chromosomes not present in `chrom_sizes` are dropped. | ||
| /// Empty regions (start >= end after clamping) are dropped. |
There was a problem hiding this comment.
The documentation states "Empty regions (start >= end after clamping) are dropped" but the implementation keeps zero-width regions (where start == end). The test test_trim_keeps_zero_width_after_clamp explicitly verifies this behavior. Either the documentation should be updated to clarify that zero-width regions are kept (changing ">=" to ">"), or the implementation should be changed to drop zero-width regions (changing line 87 from start > end to start >= end).
| /// Empty regions (start >= end after clamping) are dropped. | |
| /// Empty regions (start > end after clamping) are dropped; zero-width regions where `start == end` are kept. |
There was a problem hiding this comment.
Addressed. Updated the trim doc to: "Empty regions (start > end after clamping) are dropped; zero-width regions where start == end are kept."
| } | ||
| } | ||
|
|
||
| fn make_regionset(regions: Vec<(& str, u32, u32)>) -> RegionSet { |
There was a problem hiding this comment.
Inconsistent spacing in reference type. Should be &str without a space, consistent with the rest of the codebase (see lines 16, 257).
| fn make_regionset(regions: Vec<(& str, u32, u32)>) -> RegionSet { | |
| fn make_regionset(regions: Vec<(&str, u32, u32)>) -> RegionSet { |
There was a problem hiding this comment.
Fixed. Changed & str → &str.
| "chr22", | ||
| ]; | ||
|
|
||
| const CHROM_SIZES: &[(& str, u32)] = &[ |
There was a problem hiding this comment.
Inconsistent spacing in reference type. Should be &str without a space for consistency with Rust conventions.
| const CHROM_SIZES: &[(& str, u32)] = &[ | |
| const CHROM_SIZES: &[(&str, u32)] = &[ |
There was a problem hiding this comment.
Fixed. Changed & str → &str.
| /// Pairwise intersection of two region sets of equal length. | ||
| /// | ||
| /// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`. | ||
| /// Pairs with no overlap or mismatched chromosomes are dropped. |
There was a problem hiding this comment.
The documentation states "Pairwise intersection of two region sets of equal length" but doesn't specify what happens when the lengths differ. The implementation uses zip which will silently truncate to the shorter length. Consider either: (1) adding a length check and returning an error/empty result for mismatched lengths, or (2) documenting that excess regions are ignored when lengths differ.
| /// Pairwise intersection of two region sets of equal length. | |
| /// | |
| /// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`. | |
| /// Pairs with no overlap or mismatched chromosomes are dropped. | |
| /// Pairwise intersection of two region sets. | |
| /// | |
| /// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`. | |
| /// Pairs with no overlap or mismatched chromosomes are dropped. | |
| /// If the region sets differ in length, intersections are computed only up to the length | |
| /// of the shorter set; any excess regions in the longer set are ignored. |
There was a problem hiding this comment.
Addressed. Updated the doc to state: "If the region sets differ in length, intersections are computed only up to the length of the shorter set; excess regions are ignored." Also removed the "of equal length" phrasing from the summary line.
| i += 1; | ||
| scale = Some(args[i].parse().expect("Invalid scale value")); | ||
| } | ||
| "--input" => { | ||
| i += 1; | ||
| input_path = Some(args[i].clone()); | ||
| } | ||
| "--output" => { | ||
| i += 1; | ||
| output_csv = Some(args[i].clone()); | ||
| } | ||
| "--output-beds" => { | ||
| i += 1; |
There was a problem hiding this comment.
Missing bounds check before accessing args[i] after incrementing i. If a flag like --scale, --input, --output, or --output-beds is provided as the last argument without a value, this will panic with an index out of bounds error. Add a check like if i >= args.len() before accessing the argument value.
| i += 1; | |
| scale = Some(args[i].parse().expect("Invalid scale value")); | |
| } | |
| "--input" => { | |
| i += 1; | |
| input_path = Some(args[i].clone()); | |
| } | |
| "--output" => { | |
| i += 1; | |
| output_csv = Some(args[i].clone()); | |
| } | |
| "--output-beds" => { | |
| i += 1; | |
| i += 1; | |
| if i >= args.len() { | |
| eprintln!("Missing value for --scale"); | |
| std::process::exit(1); | |
| } | |
| scale = Some(args[i].parse().expect("Invalid scale value")); | |
| } | |
| "--input" => { | |
| i += 1; | |
| if i >= args.len() { | |
| eprintln!("Missing value for --input"); | |
| std::process::exit(1); | |
| } | |
| input_path = Some(args[i].clone()); | |
| } | |
| "--output" => { | |
| i += 1; | |
| if i >= args.len() { | |
| eprintln!("Missing value for --output"); | |
| std::process::exit(1); | |
| } | |
| output_csv = Some(args[i].clone()); | |
| } | |
| "--output-beds" => { | |
| i += 1; | |
| if i >= args.len() { | |
| eprintln!("Missing value for --output-beds"); | |
| std::process::exit(1); | |
| } |
There was a problem hiding this comment.
Addressed. Added bounds checks (if i >= args.len()) for all flags in both benchmark_interval_ranges.rs and benchmark_partitions.rs.
|
Note: most of the replies to review comments on this PR were written by Claude (AI assistant), addressing the feedback from khoroshevskyi and Copilot. |
Ports GenomicDistributions calc functions to Rust: Partition system: - genome_partition_list, calc_partitions, calc_expected_partitions - GTF/BED gene model loading with GENCODE UTR classification - Chi-square expected partition analysis with regularized incomplete gamma Statistics: - calc_nearest_neighbors (min upstream/downstream per region) - calc_widths (region end - start) - calc_feature_distances (signed distance to nearest feature, matching R convention) Performance: - is_sorted flag on RegionSet: reduce() checks this flag and skips the clone+sort when input is known-sorted (e.g. after BED loading or sort()). Cuts ~27% off genome_partition_list, which calls reduce() ~8 times on already-sorted intermediate RegionSets. PR review feedback addressed: - Lexicographic chromosome sort in reduce() (matches BED convention) - setdiff/pintersect docstring examples - Document rest field not preserved, zero-width region behavior - pintersect truncation behavior documented - &str spacing fixes All 57 tests pass. Cross-validated against R on 4 ENCODE BED files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
No description provided.