Enabling multiprocessing in sv_inference.py by heesuallykim · Pull Request #17 · SUwonglab/arcsv

heesuallykim · 2025-11-21T01:14:40Z

I've added multiprocessing steps to the sv_inference.py code, as well as adding an n_processes option in the CLI. I'm pasting Claude's interpretation of the multiprocessing steps below in a visual guide.

Visual Guide: How Multiprocessing Works in sv_inference

Overview Diagram

┌─────────────────────────────────────────────────────────────┐
│              SV INFERENCE PIPELINE                          │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  1. Load Data & Build Graph                                 │
│     (Sequential - not parallelized)                         │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  2. Decompose Graph into Subgraphs                          │
│     (Sequential - very fast)                                │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  3. Test for Insertions (PARALLELIZED! 🚀)                 │
│                                                             │
│     ORIGINAL (Sequential):                                  │
│     Block 1 ──> Block 2 ──> Block 3 ──> ... ──> Block N    │
│     Time: T1 + T2 + T3 + ... + TN                          │
│                                                             │
│     OPTIMIZED (Parallel):                                   │
│     Block 1 ─┐                                              │
│     Block 2 ─┤                                              │
│     Block 3 ─┼──> Process Pool ──> Results                 │
│     Block 4 ─┤                                              │
│     Block N ─┘                                              │
│     Time: max(T1, T2, ..., TN) / num_cores                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  4. Expand & Merge Subgraphs                                │
│     (Sequential - very fast)                                │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  5. Process Subgraphs (PARALLELIZED! 🚀🚀🚀)               │
│     (This is the BIGGEST optimization)                      │
│                                                             │
│     ORIGINAL (Sequential):                                  │
│     Subgraph 1 ──> Subgraph 2 ──> ... ──> Subgraph M       │
│     Time: S1 + S2 + ... + SM                               │
│                                                             │
│     OPTIMIZED (Parallel):                                   │
│     Subgraph 1 ─┐                                           │
│     Subgraph 2 ─┤                                           │
│     Subgraph 3 ─┼──> Process Pool ──> SV Calls            │
│     Subgraph 4 ─┤                                           │
│     Subgraph M ─┘                                           │
│     Time: max(S1, S2, ..., SM) / num_cores                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  6. Write Output Files                                      │
│     (Sequential - must maintain order)                      │
└─────────────────────────────────────────────────────────────┘

Data Flow Diagram

┌──────────────────────────────────────────────────────────┐
│                    Main Process                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │  Input Data: blocks, graph, options, ...          │  │
│  └────────────────────────────────────────────────────┘  │
│                          │                               │
│                          │ Pickle & Send                 │
│                          ▼                               │
│  ┌─────────────────────────────────────────────────┐    │
│  │         Process Pool (multiprocessing)          │    │
│  │                                                  │    │
│  │  ┌──────────────┐  ┌──────────────┐            │    │
│  │  │  Worker 1    │  │  Worker 2    │            │    │
│  │  │              │  │              │   ...      │    │
│  │  │ [Subgraph 1] │  │ [Subgraph 2] │            │    │
│  │  │              │  │              │            │    │
│  │  │  Compute     │  │  Compute     │            │    │
│  │  │  Likelihood  │  │  Likelihood  │            │    │
│  │  │  Call SVs    │  │  Call SVs    │            │    │
│  │  │              │  │              │            │    │
│  │  └──────┬───────┘  └──────┬───────┘            │    │
│  │         │                 │                     │    │
│  │         └─────────┬───────┘                     │    │
│  │                   │ Pickle & Return             │    │
│  └───────────────────┼─────────────────────────────┘    │
│                      ▼                                   │
│  ┌────────────────────────────────────────────────────┐ │
│  │  Collect Results: SV calls, VCF lines, stats      │ │
│  └────────────────────────────────────────────────────┘ │
│                          │                              │
│                          ▼                              │
│  ┌────────────────────────────────────────────────────┐ │
│  │  Write Output Files (sequential)                   │ │
│  └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘

Multiprocessing Optimization for sv_inference.py

Overview

This document explains the multiprocessing optimizations added to sv_inference.py to improve runtime performance. The optimized version (sv_inference_mp.py) uses Python's multiprocessing module to parallelize computationally expensive operations.

Key Changes

1. Parallel Insertion Testing

The original code tested for insertions sequentially across all blocks. This is now parallelized:

Original (Sequential):

for b in range(0, len(blocks) - 1):
    # Test insertion for each block one at a time
    # This can be slow when there are many blocks

Optimized (Parallel):

# Create a pool of worker processes
with Pool(processes=n_processes) as pool:
    results = pool.map(test_single_insertion, block_args)

Why this helps: Each block's insertion test is independent, so they can run simultaneously on different CPU cores.

2. Parallel Subgraph Processing

The most significant optimization is parallelizing subgraph processing:

Original (Sequential):

for sub in subgraphs:
    # Process each subgraph one at a time
    # This is the most time-consuming part
    process_subgraph(...)

Optimized (Parallel):

with Pool(processes=n_processes) as pool:
    results = pool.map(process_single_subgraph, subgraph_args)

Why this helps: Each subgraph represents an independent region of the genome. Processing them in parallel can dramatically reduce runtime when you have many subgraphs.

3. Automatic CPU Detection

The code automatically determines the optimal number of processes to use:

n_processes = opts.get('n_processes', max(1, cpu_count() - 1))

This uses all available CPU cores minus one (to keep the system responsive).

Expected Performance Improvements

Performance gains depend on your specific workload:

Insertion Testing:
- Speedup: ~2-8x (depends on number of blocks)
- Most beneficial when: Testing many blocks for insertions
Subgraph Processing:
- Speedup: ~2-16x (depends on number of subgraphs and CPU cores)
- Most beneficial when: You have many independent subgraphs
Overall Runtime:
- Typical speedup: 2-6x for most genomic regions
- Best case: 10-15x for highly parallelizable regions

Important Notes

Memory Considerations

Multiprocessing creates separate processes, each with its own memory:

Each process gets a copy of the data it needs
Memory usage = single_process_memory × n_processes
Monitor your system's RAM usage

Limitations

Some parts remain sequential because they:

Require file I/O: Writing to output files must be sequential to maintain order
Share state: Graph decomposition needs to see the full graph
Are already fast: Some operations are quick enough that parallelization overhead wouldn't help

Technical Details

Worker Function Design

Each parallelized operation has a dedicated worker function:

test_single_insertion(args): Tests one block for insertions
process_single_subgraph(args): Processes one subgraph

These functions:

Take a tuple of arguments (required by multiprocessing.Pool.map)
Are independent (don't share state)
Return serializable results (no file handles or complex objects)

Data Serialization

Python's multiprocessing uses pickle to send data between processes:

Arguments are serialized (pickled) and sent to worker processes
Results are serialized (pickled) and sent back to main process
This is why we can't pass file handles or certain complex objects

Debugging Tips

If you encounter issues:

Start with 1 process to verify the code works:
```
opts['n_processes'] = 1
```

Check memory usage:

# While running, monitor memory in another terminal
watch -n 1 'free -h'

Enable verbose output:

opts['verbosity'] = 2  # See detailed progress

Test with small regions first:
- Process a small genomic region to verify functionality
- Then scale up to larger regions

Benchmarking

To measure speedup:

import time

# Original version
start = time.time()
do_inference(opts, ...)  # Original function
original_time = time.time() - start

# Optimized version
start = time.time()
do_inference(opts, ...)  # Optimized function with n_processes set
optimized_time = time.time() - start

speedup = original_time / optimized_time
print(f"Speedup: {speedup:.2f}x")

Further Optimization Possibilities

Future enhancements could include:

Path likelihood computation: Parallelize likelihood calculations for different paths within a subgraph
Hybrid parallelization: Use multiprocessing + multithreading for I/O-bound operations
GPU acceleration: For likelihood computations (requires significant refactoring)
Distributed computing: Process different chromosomes on different machines

Conclusion

The multiprocessing optimization provides significant speedups for most genomic SV calling workloads, especially when:

Processing large regions with many blocks
Analyzing regions with multiple subgraphs
Running on multi-core systems

The changes are backward-compatible: settingn_processes to 1 causes the code behaves like the original sequential version.

jgarthur · 2025-12-07T03:21:53Z

@heesuallykim were you able to test this and confirm the speedup?

heesuallykim · 2026-01-23T07:42:32Z

Hi, not yet - give me another 2 weeks for me to get back to you. There are a couple bugs I'm working on. I apologize for the delay!

jgarthur · 2026-01-23T19:54:27Z

Hi, not yet - give me another 2 weeks for me to get back to you. There are a couple bugs I'm working on. I apologize for the delay!

thanks! no hurry on my end

heesuallykim added 2 commits November 20, 2025 16:43

Add multiprocessing to sv_inference.py

7f0d4f3

Add multiprocessing to sv_inference.py

1ca234f

jgarthur self-requested a review November 21, 2025 20:04

added multiprocessing

74f8669

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling multiprocessing in sv_inference.py#17

Enabling multiprocessing in sv_inference.py#17
heesuallykim wants to merge 3 commits intoSUwonglab:masterfrom
heesuallykim:multiprocessing

heesuallykim commented Nov 21, 2025

Uh oh!

jgarthur commented Dec 7, 2025

Uh oh!

heesuallykim commented Jan 23, 2026

Uh oh!

jgarthur commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

heesuallykim commented Nov 21, 2025

Visual Guide: How Multiprocessing Works in sv_inference

Overview Diagram

Data Flow Diagram

Multiprocessing Optimization for sv_inference.py

Overview

Key Changes

1. Parallel Insertion Testing

2. Parallel Subgraph Processing

3. Automatic CPU Detection

Expected Performance Improvements

Important Notes

Memory Considerations

Limitations

Technical Details

Worker Function Design

Data Serialization

Debugging Tips

Benchmarking

Further Optimization Possibilities

Conclusion

Uh oh!

jgarthur commented Dec 7, 2025

Uh oh!

heesuallykim commented Jan 23, 2026

Uh oh!

jgarthur commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants