working PR for genomic distributions port by sanghoonio · Pull Request #231 · databio/gtars

sanghoonio · 2026-02-19T00:07:24Z

No description provided.

Release v0.5.3

Release 0.6.0

Release v0.6.1

Implements trim, promoters, reduce, setdiff, and pintersect as a trait on RegionSet in gtars-genomicdist. Uses natural chromosome sort order, preserves zero-width intervals, and saturates at 0 for promoters. Includes 26 unit tests, demo binary, and benchmark example. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

khoroshevskyi

LGTM
I have just few questions, and suggestions!

khoroshevskyi · 2026-02-19T16:25:47Z

gtars-genomicdist/src/interval_ranges.rs

+fn chrom_cmp(a: &str, b: &str) -> Ordering {
+    chrom_sort_key(a).cmp(&chrom_sort_key(b))
+}
+
+/// Returns (priority, numeric_value, suffix) for sorting.
+/// Priority 0 = numeric chrN, 1 = chrX/Y/M, 2 = everything else.
+fn chrom_sort_key(chr: &str) -> (u8, u32, String) {
+    let suffix = chr
+        .strip_prefix("chr")
+        .or_else(|| chr.strip_prefix("Chr"))
+        .unwrap_or(chr);
+
+    if let Ok(n) = suffix.parse::<u32>() {
+        (0, n, String::new())
+    } else {
+        match suffix {
+            "X" | "x" => (1, 0, String::new()),
+            "Y" | "y" => (1, 1, String::new()),
+            "M" | "m" | "MT" | "mt" => (1, 2, String::new()),
+            _ => (2, 0, chr.to_string()),
+        }
+    }
+}


Why do we need to have natural sort for bed files?

Addressed. Switched from natural sort to lexicographic sort, which matches BED convention and what bedToBigBed expects. Removed the chrom_cmp/chrom_sort_key helpers entirely.

khoroshevskyi · 2026-02-19T16:34:58Z

gtars-genomicdist/src/interval_ranges.rs

+        let mut sorted = self.regions.clone();
+        sorted.sort_by(|a, b| chrom_cmp(&a.chr, &b.chr).then_with(|| a.start.cmp(&b.start)));


Again here, why do we want to have natural sort. All bed files are already sorted, but they are sorted using usual sort, and that's what bedtobigbed also wants

Looking in the code, I think we don't need to sort rs

Addressed both points:

Natural sort → lexicographic: Switched to a.chr.cmp(&b.chr), matching BED convention. Removed the natural sort helpers.

"Don't need to sort": The sort is still necessary because reduce() can receive unsorted input from merge_regionsets() (used in genome_partition_list), promoters(), or programmatically constructed RegionSets. Added a comment explaining this.

What do you think of: having two different versions of some of these functions, one that assumes sorted and one that does not? then you can stream in the case of assuming sorted, and sometimes you can make that bound.

Good idea. We actually added an is_sorted flag on RegionSet that handles this automatically — reduce() checks the flag and skips the clone+sort when the data is known-sorted. This gave ~27% speedup on partition construction where reduce() is called ~8 times on already-sorted data.

The flag approach works well here because genome_partition_list chains multiple operations (promoters() → reduce() → setdiff() → reduce()) and it's non-obvious which intermediate RegionSets are sorted. The flag tracks that through the chain so the right path is chosen without the caller having to think about it.

Exposing explicit reduce() / reduce_sorted() variants could also work as a power-user API. The risk is that calling reduce_sorted() on unsorted data gives silently wrong results (missed merges). The flag approach avoids that — it's set by sort() and checked internally, so correctness is guaranteed.

Could add both: flag for the automatic path, plus explicit variants for users who want to bypass the check. But it's more API surface for the same outcome, so holding off unless there's a concrete use case.

khoroshevskyi · 2026-02-19T17:14:10Z

gtars-genomicdist/src/interval_ranges.rs

+    ///
+    /// Removes portions of `self` that overlap with `other`. Both inputs are
+    /// reduced internally before subtraction. Operates per-chromosome with a
+    /// sweep-line algorithm.


Suggested change

/// sweep-line algorithm.

/// sweep-line algorithm.

///

/// Example:

/// Input:

/// A: chr1 100–200

/// B: chr1 120–140

/// chr1 160–180

/// Output:

/// 100–120

/// 140–160

/// 180–200

Added. The setdiff docstring now includes a concrete example:

A: chr1 100–200 B: chr1 120–140, chr1 160–180 setdiff(A, B): chr1 100–120, chr1 140–160, chr1 180–200

khoroshevskyi · 2026-02-19T17:22:37Z

gtars-genomicdist/src/interval_ranges.rs

+    /// Pairwise intersection of two region sets of equal length.
+    ///
+    /// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`.
+    /// Pairs with no overlap or mismatched chromosomes are dropped.


Suggested change

/// Pairs with no overlap or mismatched chromosomes are dropped.

/// Pairs with no overlap or mismatched chromosomes are dropped.

///

/// Example:

/// A: chr1 100–200

/// chr1 300–400

/// chr2 100–150

///

/// B: chr1 150–250

/// chr1 350–360

/// chr2 200–300

/// Return:

/// chr1 150–200

/// chr1 350–360

///

///

/// ! It intersects by index, not by genomic overlap across all regions.

Added. The pintersect docstring now includes a concrete example and the clarifying note:

Note: this intersects by index position (1st with 1st, 2nd with 2nd, etc.), not by genomic overlap across all regions.

Also documented the truncation behavior when sets differ in length.

Copilot

Pull request overview

This PR introduces genomic interval set algebra operations to the gtars-genomicdist crate, porting functionality similar to R's GenomicRanges/IRanges package. The implementation provides five core operations (trim, promoters, reduce, setdiff, pintersect) with comprehensive tests and examples.

Changes:

Added IntervalRanges trait with five genomic interval operations using 0-based half-open BED coordinates
Implemented natural chromosome sorting (chr1, chr2, chr10, chrX, chrY, chrM) rather than lexicographic
Added demo and benchmark examples with cross-language validation support

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
gtars-genomicdist/src/interval_ranges.rs	Core implementation of IntervalRanges trait with trim, promoters, reduce, setdiff, and pintersect operations, plus comprehensive test suite
gtars-genomicdist/src/lib.rs	Module declaration and public exports for interval_ranges
tests/data/regionset/dummy_b.bed	Test data file for setdiff operation validation
gtars-genomicdist/examples/interval_ranges_demo.rs	Interactive demo showing all five operations on user-provided BED files
gtars-genomicdist/examples/benchmark_interval_ranges.rs	Comprehensive benchmark suite with synthetic data generation and CSV output
gtars-genomicdist/Cargo.toml	Added rand dependency for benchmark synthetic data generation
.github/workflows/rust-publish.yml	Added gtars-genomicdist to publishable crates and cleaned trailing whitespace

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T17:29:40Z

gtars-genomicdist/src/interval_ranges.rs

+pub trait IntervalRanges {
+    /// Clamp regions to chromosome boundaries.
+    ///
+    /// Regions extending past chromosome ends are trimmed to `[0, chrom_size)`.
+    /// Regions on chromosomes not present in `chrom_sizes` are dropped.
+    /// Empty regions (start >= end after clamping) are dropped.
+    fn trim(&self, chrom_sizes: &HashMap<String, u32>) -> RegionSet;
+
+    /// Generate promoter regions relative to each region's start position.
+    ///
+    /// For each region, produces `[start - upstream, start + downstream)`.
+    /// Uses saturating subtraction at coordinate 0.
+    fn promoters(&self, upstream: u32, downstream: u32) -> RegionSet;
+
+    /// Merge overlapping and adjacent intervals per chromosome.
+    ///
+    /// Sorts by (chr, start), then sweeps to merge intervals where
+    /// `next.start <= current.end`. Returns a minimal set of non-overlapping regions.
+    fn reduce(&self) -> RegionSet;
+
+    /// Subtract one region set from another (set difference).
+    ///
+    /// Removes portions of `self` that overlap with `other`. Both inputs are
+    /// reduced internally before subtraction. Operates per-chromosome with a
+    /// sweep-line algorithm.
+    fn setdiff(&self, other: &RegionSet) -> RegionSet;
+
+    /// Pairwise intersection of two region sets of equal length.
+    ///
+    /// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`.
+    /// Pairs with no overlap or mismatched chromosomes are dropped.
+    fn pintersect(&self, other: &RegionSet) -> RegionSet;
+}


All IntervalRanges operations discard the rest field from input regions by setting it to None in the output. This behavior should be documented in the trait documentation to make it clear that extra BED columns (name, score, strand, etc.) are not preserved through these operations. Consider adding a note to the module-level documentation or trait documentation.

Addressed. Added a note to the IntervalRanges trait doc:

Note: All operations produce regions with rest: None. Metadata from the rest field of input regions is not preserved — operations like reduce() merge multiple regions, so there is no unambiguous rest to carry forward.

Copilot · 2026-02-19T17:29:41Z

gtars-genomicdist/src/interval_ranges.rs

+    ///
+    /// Regions extending past chromosome ends are trimmed to `[0, chrom_size)`.
+    /// Regions on chromosomes not present in `chrom_sizes` are dropped.
+    /// Empty regions (start >= end after clamping) are dropped.


The documentation states "Empty regions (start >= end after clamping) are dropped" but the implementation keeps zero-width regions (where start == end). The test test_trim_keeps_zero_width_after_clamp explicitly verifies this behavior. Either the documentation should be updated to clarify that zero-width regions are kept (changing ">=" to ">"), or the implementation should be changed to drop zero-width regions (changing line 87 from start > end to start >= end).

Suggested change

/// Empty regions (start >= end after clamping) are dropped.

/// Empty regions (start > end after clamping) are dropped; zero-width regions where `start == end` are kept.

Addressed. Updated the trim doc to: "Empty regions (start > end after clamping) are dropped; zero-width regions where start == end are kept."

Copilot · 2026-02-19T17:29:41Z

gtars-genomicdist/src/interval_ranges.rs

+        }
+    }
+
+    fn make_regionset(regions: Vec<(& str, u32, u32)>) -> RegionSet {


Inconsistent spacing in reference type. Should be &str without a space, consistent with the rest of the codebase (see lines 16, 257).

Suggested change

fn make_regionset(regions: Vec<(& str, u32, u32)>) -> RegionSet {

fn make_regionset(regions: Vec<(&str, u32, u32)>) -> RegionSet {

Fixed. Changed & str → &str.

Copilot · 2026-02-19T17:29:41Z

gtars-genomicdist/examples/benchmark_interval_ranges.rs

+    "chr22",
+];
+
+const CHROM_SIZES: &[(& str, u32)] = &[


Inconsistent spacing in reference type. Should be &str without a space for consistency with Rust conventions.

Suggested change

const CHROM_SIZES: &[(& str, u32)] = &[

const CHROM_SIZES: &[(&str, u32)] = &[

Fixed. Changed & str → &str.

Copilot · 2026-02-19T17:29:41Z

gtars-genomicdist/src/interval_ranges.rs

+    /// Pairwise intersection of two region sets of equal length.
+    ///
+    /// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`.
+    /// Pairs with no overlap or mismatched chromosomes are dropped.


The documentation states "Pairwise intersection of two region sets of equal length" but doesn't specify what happens when the lengths differ. The implementation uses zip which will silently truncate to the shorter length. Consider either: (1) adding a length check and returning an error/empty result for mismatched lengths, or (2) documenting that excess regions are ignored when lengths differ.

Suggested change

/// Pairwise intersection of two region sets of equal length.

///

/// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`.

/// Pairs with no overlap or mismatched chromosomes are dropped.

/// Pairwise intersection of two region sets.

///

/// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`.

/// Pairs with no overlap or mismatched chromosomes are dropped.

/// If the region sets differ in length, intersections are computed only up to the length

/// of the shorter set; any excess regions in the longer set are ignored.

Addressed. Updated the doc to state: "If the region sets differ in length, intersections are computed only up to the length of the shorter set; excess regions are ignored." Also removed the "of equal length" phrasing from the summary line.

Copilot · 2026-02-19T17:29:42Z

gtars-genomicdist/examples/benchmark_interval_ranges.rs

+                i += 1;
+                scale = Some(args[i].parse().expect("Invalid scale value"));
+            }
+            "--input" => {
+                i += 1;
+                input_path = Some(args[i].clone());
+            }
+            "--output" => {
+                i += 1;
+                output_csv = Some(args[i].clone());
+            }
+            "--output-beds" => {
+                i += 1;


Missing bounds check before accessing args[i] after incrementing i. If a flag like --scale, --input, --output, or --output-beds is provided as the last argument without a value, this will panic with an index out of bounds error. Add a check like if i >= args.len() before accessing the argument value.

Suggested change

i += 1;

scale = Some(args[i].parse().expect("Invalid scale value"));

}

"--input" => {

i += 1;

input_path = Some(args[i].clone());

}

"--output" => {

i += 1;

output_csv = Some(args[i].clone());

}

"--output-beds" => {

i += 1;

i += 1;

if i >= args.len() {

eprintln!("Missing value for --scale");

std::process::exit(1);

}

scale = Some(args[i].parse().expect("Invalid scale value"));

}

"--input" => {

i += 1;

if i >= args.len() {

eprintln!("Missing value for --input");

std::process::exit(1);

}

input_path = Some(args[i].clone());

}

"--output" => {

i += 1;

if i >= args.len() {

eprintln!("Missing value for --output");

std::process::exit(1);

}

output_csv = Some(args[i].clone());

}

"--output-beds" => {

i += 1;

if i >= args.len() {

eprintln!("Missing value for --output-beds");

std::process::exit(1);

}

Addressed. Added bounds checks (if i >= args.len()) for all flags in both benchmark_interval_ranges.rs and benchmark_partitions.rs.

sanghoonio · 2026-02-20T04:29:44Z

Note: most of the replies to review comments on this PR were written by Claude (AI assistant), addressing the feedback from khoroshevskyi and Copilot.

Ports GenomicDistributions calc functions to Rust: Partition system: - genome_partition_list, calc_partitions, calc_expected_partitions - GTF/BED gene model loading with GENCODE UTR classification - Chi-square expected partition analysis with regularized incomplete gamma Statistics: - calc_nearest_neighbors (min upstream/downstream per region) - calc_widths (region end - start) - calc_feature_distances (signed distance to nearest feature, matching R convention) Performance: - is_sorted flag on RegionSet: reduce() checks this flag and skips the clone+sort when input is known-sorted (e.g. after BED loading or sort()). Cuts ~27% off genome_partition_list, which calls reduce() ~8 times on already-sorted intermediate RegionSets. PR review feedback addressed: - Lexicographic chromosome sort in reduce() (matches BED convention) - setdiff/pintersect docstring examples - Document rest field not preserved, zero-width region behavior - pintersect truncation behavior documented - &str spacing fixes All 57 tests pass. Cross-validated against R on 4 ENCODE BED files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

khoroshevskyi and others added 6 commits January 20, 2026 22:51

Merge pull request #221 from databio/dev

2e674fd

Release v0.5.3

Add gtars-genomicdist to workflow dependencies

96f4038

Merge pull request #226 from databio/dev

a0443bb

Release 0.6.0

Merge pull request #228 from databio/dev

4221004

Release v0.6.1

Merge branch 'dev'

00405cc

sanghoonio force-pushed the genomicdist_partitions branch from 4218d9c to 6f465f2 Compare February 19, 2026 00:25

khoroshevskyi reviewed Feb 19, 2026

View reviewed changes

khoroshevskyi requested review from Copilot and nleroy917 February 19, 2026 17:24

Copilot started reviewing on behalf of khoroshevskyi February 19, 2026 17:24 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

		let mut sorted = self.regions.clone();
		sorted.sort_by(\|a, b\| chrom_cmp(&a.chr, &b.chr).then_with(\|\| a.start.cmp(&b.start)));

-    /// sweep-line algorithm.
+    /// sweep-line algorithm.
+    ///
+    /// Example:
+    /// Input:
+    ///   A: chr1 100–200
+    ///   B: chr1 120–140
+    ///       chr1 160–180
+    /// Output:
+    ///   100–120
+    ///   140–160
+    ///   180–200

-    /// Pairs with no overlap or mismatched chromosomes are dropped.
+    /// Pairs with no overlap or mismatched chromosomes are dropped.
+    ///
+    /// Example:
+    /// A: chr1 100–200
+    ///     chr1 300–400
+    ///     chr2 100–150
+    ///
+    /// B: chr1 150–250
+    ///     chr1 350–360
+    ///     chr2 200–300
+    /// Return:
+    ///     chr1 150–200
+    ///     chr1 350–360
+    ///
+    ///
+    /// ! It intersects by index, not by genomic overlap across all regions.

	/// Empty regions (start >= end after clamping) are dropped.
	/// Empty regions (start > end after clamping) are dropped; zero-width regions where `start == end` are kept.

	fn make_regionset(regions: Vec<(& str, u32, u32)>) -> RegionSet {
	fn make_regionset(regions: Vec<(&str, u32, u32)>) -> RegionSet {

	const CHROM_SIZES: &[(& str, u32)] = &[
	const CHROM_SIZES: &[(&str, u32)] = &[

Conversation

sanghoonio commented Feb 19, 2026

Uh oh!

khoroshevskyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanghoonio commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments