Cache exon counts before transcript sort by riasc · Pull Request #22 · ylab-hi/atroplex

riasc · 2026-04-09T17:28:47Z

Summary

Removes O(T·E·log T) rescans from the transcript sort comparator in process_gene().

Problem

The comparator called count_exons() on every comparison, which scans the entire entry vector:

std::sort(tx_order.begin(), tx_order.end(),
    [&](const auto& a, const auto& b) {
        return count_exons(transcripts[a]) > count_exons(transcripts[b]);  // O(E) per comparison
    });

For a gene with T transcripts and E entries per transcript, this was O(T·E·log T) work. For ENCODE genes with 10+ transcripts and 50+ entries each, that's ~250,000 exon-type checks per gene.

Fix

Precompute exon counts once per transcript before sorting:

std::vector<std::pair<std::string, size_t>> tx_exon_counts;
// one pass to count
std::sort(..., [](const auto& a, const auto& b) { return a.second > b.second; });

Now it's O(T·E + T·log T) — the counting is done once, and the sort compares integers.

Impact

Estimated 2-3x speedup per gene in the transcript sort phase.

QC

I, as a human being, have checked each line of code in this pull request
Project builds successfully in CLion
All CI checks pass (GCC 13/14, Clang 18, macOS)
All tests pass (absorption, discover, query)

🤖 Generated with Claude Code

The sort comparator previously called count_exons() on every comparison, which scans the entire entry vector. For a gene with T transcripts and E entries per transcript, this was O(T·E·log T) work. Now exon counts are computed once per transcript before sorting, making it O(T·E + T·log T).

riasc · 2026-04-09T17:55:14Z

Code Review — PR #22

Verdict: Approve

Diff: 1 file, +16/-11

Check	Assessment
Correctness	Sort order is preserved. Previous code used unstable `std::sort` on the unordered_map iteration order; new code does the same. `std::move(tx_id)` after pair destructuring is correct (tx_exon_counts isn't used after).
Behavior	Identical output — same transcripts processed in the same (exon-count descending) order. Only the work done by the comparator changes.
Complexity	From O(T·E·log T) rescans to O(T·E + T·log T) — counts computed once, sort compares integers.
Absorption impact	None — absorption logic downstream is unchanged.
Tests	Existing absorption tests cover this code path and verify final segment state, so any regression would be caught.

No issues found

🤖 Generated with Claude Code

riasc added the enhancement New feature or request label Apr 9, 2026

Update CHANGELOG with PR #22 changes

9669a85

riasc merged commit 3e88419 into main Apr 9, 2026
8 checks passed

riasc deleted the perf/cache-exon-counts branch April 9, 2026 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache exon counts before transcript sort#22

Cache exon counts before transcript sort#22
riasc merged 2 commits intomainfrom
perf/cache-exon-counts

riasc commented Apr 9, 2026 •

edited

Loading

Uh oh!

riasc commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

riasc commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Impact

QC

Uh oh!

riasc commented Apr 9, 2026

Code Review — PR #22

Diff: 1 file, +16/-11

No issues found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

riasc commented Apr 9, 2026 •

edited

Loading