Skip to content

Cache exon counts before transcript sort#22

Merged
riasc merged 2 commits intomainfrom
perf/cache-exon-counts
Apr 9, 2026
Merged

Cache exon counts before transcript sort#22
riasc merged 2 commits intomainfrom
perf/cache-exon-counts

Conversation

@riasc
Copy link
Copy Markdown
Collaborator

@riasc riasc commented Apr 9, 2026

Summary

Removes O(T·E·log T) rescans from the transcript sort comparator in process_gene().

Problem

The comparator called count_exons() on every comparison, which scans the entire entry vector:

std::sort(tx_order.begin(), tx_order.end(),
    [&](const auto& a, const auto& b) {
        return count_exons(transcripts[a]) > count_exons(transcripts[b]);  // O(E) per comparison
    });

For a gene with T transcripts and E entries per transcript, this was O(T·E·log T) work. For ENCODE genes with 10+ transcripts and 50+ entries each, that's ~250,000 exon-type checks per gene.

Fix

Precompute exon counts once per transcript before sorting:

std::vector<std::pair<std::string, size_t>> tx_exon_counts;
// one pass to count
std::sort(..., [](const auto& a, const auto& b) { return a.second > b.second; });

Now it's O(T·E + T·log T) — the counting is done once, and the sort compares integers.

Impact

Estimated 2-3x speedup per gene in the transcript sort phase.

QC

  • I, as a human being, have checked each line of code in this pull request
  • Project builds successfully in CLion
  • All CI checks pass (GCC 13/14, Clang 18, macOS)
  • All tests pass (absorption, discover, query)

🤖 Generated with Claude Code

The sort comparator previously called count_exons() on every comparison,
which scans the entire entry vector. For a gene with T transcripts and
E entries per transcript, this was O(T·E·log T) work. Now exon counts
are computed once per transcript before sorting, making it O(T·E + T·log T).
@riasc riasc added the enhancement New feature or request label Apr 9, 2026
@riasc
Copy link
Copy Markdown
Collaborator Author

riasc commented Apr 9, 2026

Code Review — PR #22

Verdict: Approve

Diff: 1 file, +16/-11

Check Assessment
Correctness Sort order is preserved. Previous code used unstable std::sort on the unordered_map iteration order; new code does the same. std::move(tx_id) after pair destructuring is correct (tx_exon_counts isn't used after).
Behavior Identical output — same transcripts processed in the same (exon-count descending) order. Only the work done by the comparator changes.
Complexity From O(T·E·log T) rescans to O(T·E + T·log T) — counts computed once, sort compares integers.
Absorption impact None — absorption logic downstream is unchanged.
Tests Existing absorption tests cover this code path and verify final segment state, so any regression would be caught.

No issues found

🤖 Generated with Claude Code

@riasc riasc merged commit 3e88419 into main Apr 9, 2026
8 checks passed
@riasc riasc deleted the perf/cache-exon-counts branch April 9, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant