Skip to content

working PR for genomic distributions port#231

Open
sanghoonio wants to merge 7 commits intodevfrom
genomicdist_partitions
Open

working PR for genomic distributions port#231
sanghoonio wants to merge 7 commits intodevfrom
genomicdist_partitions

Conversation

@sanghoonio
Copy link
Member

No description provided.

khoroshevskyi and others added 6 commits January 20, 2026 22:51
Implements trim, promoters, reduce, setdiff, and pintersect as a trait
on RegionSet in gtars-genomicdist. Uses natural chromosome sort order,
preserves zero-width intervals, and saturates at 0 for promoters.
Includes 26 unit tests, demo binary, and benchmark example.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sanghoonio sanghoonio force-pushed the genomicdist_partitions branch from 4218d9c to 6f465f2 Compare February 19, 2026 00:25
Copy link
Member

@khoroshevskyi khoroshevskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I have just few questions, and suggestions!

Comment on lines 16 to 38
fn chrom_cmp(a: &str, b: &str) -> Ordering {
chrom_sort_key(a).cmp(&chrom_sort_key(b))
}

/// Returns (priority, numeric_value, suffix) for sorting.
/// Priority 0 = numeric chrN, 1 = chrX/Y/M, 2 = everything else.
fn chrom_sort_key(chr: &str) -> (u8, u32, String) {
let suffix = chr
.strip_prefix("chr")
.or_else(|| chr.strip_prefix("Chr"))
.unwrap_or(chr);

if let Ok(n) = suffix.parse::<u32>() {
(0, n, String::new())
} else {
match suffix {
"X" | "x" => (1, 0, String::new()),
"Y" | "y" => (1, 1, String::new()),
"M" | "m" | "MT" | "mt" => (1, 2, String::new()),
_ => (2, 0, chr.to_string()),
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to have natural sort for bed files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Switched from natural sort to lexicographic sort, which matches BED convention and what bedToBigBed expects. Removed the chrom_cmp/chrom_sort_key helpers entirely.

Comment on lines 121 to 122
let mut sorted = self.regions.clone();
sorted.sort_by(|a, b| chrom_cmp(&a.chr, &b.chr).then_with(|| a.start.cmp(&b.start)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, why do we want to have natural sort. All bed files are already sorted, but they are sorted using usual sort, and that's what bedtobigbed also wants

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking in the code, I think we don't need to sort rs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed both points:

  1. Natural sort → lexicographic: Switched to a.chr.cmp(&b.chr), matching BED convention. Removed the natural sort helpers.

  2. "Don't need to sort": The sort is still necessary because reduce() can receive unsorted input from merge_regionsets() (used in genome_partition_list), promoters(), or programmatically constructed RegionSets. Added a comment explaining this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of: having two different versions of some of these functions, one that assumes sorted and one that does not? then you can stream in the case of assuming sorted, and sometimes you can make that bound.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. We actually added an is_sorted flag on RegionSet that handles this automatically — reduce() checks the flag and skips the clone+sort when the data is known-sorted. This gave ~27% speedup on partition construction where reduce() is called ~8 times on already-sorted data.

The flag approach works well here because genome_partition_list chains multiple operations (promoters()reduce()setdiff()reduce()) and it's non-obvious which intermediate RegionSets are sorted. The flag tracks that through the chain so the right path is chosen without the caller having to think about it.

Exposing explicit reduce() / reduce_sorted() variants could also work as a power-user API. The risk is that calling reduce_sorted() on unsorted data gives silently wrong results (missed merges). The flag approach avoids that — it's set by sort() and checked internally, so correctness is guaranteed.

Could add both: flag for the automatic path, plus explicit variants for users who want to bypass the check. But it's more API surface for the same outcome, so holding off unless there's a concrete use case.

///
/// Removes portions of `self` that overlap with `other`. Both inputs are
/// reduced internally before subtraction. Operates per-chromosome with a
/// sweep-line algorithm.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// sweep-line algorithm.
/// sweep-line algorithm.
///
/// Example:
/// Input:
/// A: chr1 100–200
/// B: chr1 120–140
/// chr1 160–180
/// Output:
/// 100–120
/// 140–160
/// 180–200

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. The setdiff docstring now includes a concrete example:

A: chr1 100–200
B: chr1 120–140, chr1 160–180
setdiff(A, B): chr1 100–120, chr1 140–160, chr1 180–200

/// Pairwise intersection of two region sets of equal length.
///
/// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`.
/// Pairs with no overlap or mismatched chromosomes are dropped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Pairs with no overlap or mismatched chromosomes are dropped.
/// Pairs with no overlap or mismatched chromosomes are dropped.
///
/// Example:
/// A: chr1 100–200
/// chr1 300–400
/// chr2 100–150
///
/// B: chr1 150–250
/// chr1 350–360
/// chr2 200–300
/// Return:
/// chr1 150–200
/// chr1 350–360
///
///
/// ! It intersects by index, not by genomic overlap across all regions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. The pintersect docstring now includes a concrete example and the clarifying note:

Note: this intersects by index position (1st with 1st, 2nd with 2nd, etc.), not by genomic overlap across all regions.

Also documented the truncation behavior when sets differ in length.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces genomic interval set algebra operations to the gtars-genomicdist crate, porting functionality similar to R's GenomicRanges/IRanges package. The implementation provides five core operations (trim, promoters, reduce, setdiff, pintersect) with comprehensive tests and examples.

Changes:

  • Added IntervalRanges trait with five genomic interval operations using 0-based half-open BED coordinates
  • Implemented natural chromosome sorting (chr1, chr2, chr10, chrX, chrY, chrM) rather than lexicographic
  • Added demo and benchmark examples with cross-language validation support

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
gtars-genomicdist/src/interval_ranges.rs Core implementation of IntervalRanges trait with trim, promoters, reduce, setdiff, and pintersect operations, plus comprehensive test suite
gtars-genomicdist/src/lib.rs Module declaration and public exports for interval_ranges
tests/data/regionset/dummy_b.bed Test data file for setdiff operation validation
gtars-genomicdist/examples/interval_ranges_demo.rs Interactive demo showing all five operations on user-provided BED files
gtars-genomicdist/examples/benchmark_interval_ranges.rs Comprehensive benchmark suite with synthetic data generation and CSV output
gtars-genomicdist/Cargo.toml Added rand dependency for benchmark synthetic data generation
.github/workflows/rust-publish.yml Added gtars-genomicdist to publishable crates and cleaned trailing whitespace

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 44 to 76
pub trait IntervalRanges {
/// Clamp regions to chromosome boundaries.
///
/// Regions extending past chromosome ends are trimmed to `[0, chrom_size)`.
/// Regions on chromosomes not present in `chrom_sizes` are dropped.
/// Empty regions (start >= end after clamping) are dropped.
fn trim(&self, chrom_sizes: &HashMap<String, u32>) -> RegionSet;

/// Generate promoter regions relative to each region's start position.
///
/// For each region, produces `[start - upstream, start + downstream)`.
/// Uses saturating subtraction at coordinate 0.
fn promoters(&self, upstream: u32, downstream: u32) -> RegionSet;

/// Merge overlapping and adjacent intervals per chromosome.
///
/// Sorts by (chr, start), then sweeps to merge intervals where
/// `next.start <= current.end`. Returns a minimal set of non-overlapping regions.
fn reduce(&self) -> RegionSet;

/// Subtract one region set from another (set difference).
///
/// Removes portions of `self` that overlap with `other`. Both inputs are
/// reduced internally before subtraction. Operates per-chromosome with a
/// sweep-line algorithm.
fn setdiff(&self, other: &RegionSet) -> RegionSet;

/// Pairwise intersection of two region sets of equal length.
///
/// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`.
/// Pairs with no overlap or mismatched chromosomes are dropped.
fn pintersect(&self, other: &RegionSet) -> RegionSet;
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All IntervalRanges operations discard the rest field from input regions by setting it to None in the output. This behavior should be documented in the trait documentation to make it clear that extra BED columns (name, score, strand, etc.) are not preserved through these operations. Consider adding a note to the module-level documentation or trait documentation.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Added a note to the IntervalRanges trait doc:

Note: All operations produce regions with rest: None. Metadata from the rest field of input regions is not preserved — operations like reduce() merge multiple regions, so there is no unambiguous rest to carry forward.

///
/// Regions extending past chromosome ends are trimmed to `[0, chrom_size)`.
/// Regions on chromosomes not present in `chrom_sizes` are dropped.
/// Empty regions (start >= end after clamping) are dropped.
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states "Empty regions (start >= end after clamping) are dropped" but the implementation keeps zero-width regions (where start == end). The test test_trim_keeps_zero_width_after_clamp explicitly verifies this behavior. Either the documentation should be updated to clarify that zero-width regions are kept (changing ">=" to ">"), or the implementation should be changed to drop zero-width regions (changing line 87 from start > end to start >= end).

Suggested change
/// Empty regions (start >= end after clamping) are dropped.
/// Empty regions (start > end after clamping) are dropped; zero-width regions where `start == end` are kept.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Updated the trim doc to: "Empty regions (start > end after clamping) are dropped; zero-width regions where start == end are kept."

}
}

fn make_regionset(regions: Vec<(& str, u32, u32)>) -> RegionSet {
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing in reference type. Should be &str without a space, consistent with the rest of the codebase (see lines 16, 257).

Suggested change
fn make_regionset(regions: Vec<(& str, u32, u32)>) -> RegionSet {
fn make_regionset(regions: Vec<(&str, u32, u32)>) -> RegionSet {

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Changed & str&str.

"chr22",
];

const CHROM_SIZES: &[(& str, u32)] = &[
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing in reference type. Should be &str without a space for consistency with Rust conventions.

Suggested change
const CHROM_SIZES: &[(& str, u32)] = &[
const CHROM_SIZES: &[(&str, u32)] = &[

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Changed & str&str.

Comment on lines 71 to 74
/// Pairwise intersection of two region sets of equal length.
///
/// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`.
/// Pairs with no overlap or mismatched chromosomes are dropped.
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states "Pairwise intersection of two region sets of equal length" but doesn't specify what happens when the lengths differ. The implementation uses zip which will silently truncate to the shorter length. Consider either: (1) adding a length check and returning an error/empty result for mismatched lengths, or (2) documenting that excess regions are ignored when lengths differ.

Suggested change
/// Pairwise intersection of two region sets of equal length.
///
/// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`.
/// Pairs with no overlap or mismatched chromosomes are dropped.
/// Pairwise intersection of two region sets.
///
/// For each pair at the same index, computes `[max(a.start, b.start), min(a.end, b.end))`.
/// Pairs with no overlap or mismatched chromosomes are dropped.
/// If the region sets differ in length, intersections are computed only up to the length
/// of the shorter set; any excess regions in the longer set are ignored.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Updated the doc to state: "If the region sets differ in length, intersections are computed only up to the length of the shorter set; excess regions are ignored." Also removed the "of equal length" phrasing from the summary line.

Comment on lines +217 to +229
i += 1;
scale = Some(args[i].parse().expect("Invalid scale value"));
}
"--input" => {
i += 1;
input_path = Some(args[i].clone());
}
"--output" => {
i += 1;
output_csv = Some(args[i].clone());
}
"--output-beds" => {
i += 1;
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing bounds check before accessing args[i] after incrementing i. If a flag like --scale, --input, --output, or --output-beds is provided as the last argument without a value, this will panic with an index out of bounds error. Add a check like if i >= args.len() before accessing the argument value.

Suggested change
i += 1;
scale = Some(args[i].parse().expect("Invalid scale value"));
}
"--input" => {
i += 1;
input_path = Some(args[i].clone());
}
"--output" => {
i += 1;
output_csv = Some(args[i].clone());
}
"--output-beds" => {
i += 1;
i += 1;
if i >= args.len() {
eprintln!("Missing value for --scale");
std::process::exit(1);
}
scale = Some(args[i].parse().expect("Invalid scale value"));
}
"--input" => {
i += 1;
if i >= args.len() {
eprintln!("Missing value for --input");
std::process::exit(1);
}
input_path = Some(args[i].clone());
}
"--output" => {
i += 1;
if i >= args.len() {
eprintln!("Missing value for --output");
std::process::exit(1);
}
output_csv = Some(args[i].clone());
}
"--output-beds" => {
i += 1;
if i >= args.len() {
eprintln!("Missing value for --output-beds");
std::process::exit(1);
}

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Added bounds checks (if i >= args.len()) for all flags in both benchmark_interval_ranges.rs and benchmark_partitions.rs.

@sanghoonio
Copy link
Member Author

Note: most of the replies to review comments on this PR were written by Claude (AI assistant), addressing the feedback from khoroshevskyi and Copilot.

Ports GenomicDistributions calc functions to Rust:

Partition system:
- genome_partition_list, calc_partitions, calc_expected_partitions
- GTF/BED gene model loading with GENCODE UTR classification
- Chi-square expected partition analysis with regularized incomplete gamma

Statistics:
- calc_nearest_neighbors (min upstream/downstream per region)
- calc_widths (region end - start)
- calc_feature_distances (signed distance to nearest feature, matching R convention)

Performance:
- is_sorted flag on RegionSet: reduce() checks this flag and skips the
  clone+sort when input is known-sorted (e.g. after BED loading or sort()).
  Cuts ~27% off genome_partition_list, which calls reduce() ~8 times on
  already-sorted intermediate RegionSets.

PR review feedback addressed:
- Lexicographic chromosome sort in reduce() (matches BED convention)
- setdiff/pintersect docstring examples
- Document rest field not preserved, zero-width region behavior
- pintersect truncation behavior documented
- &str spacing fixes

All 57 tests pass. Cross-validated against R on 4 ENCODE BED files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments