Enabling multiprocessing in sv_inference.py#17
Open
heesuallykim wants to merge 3 commits intoSUwonglab:masterfrom
Open
Enabling multiprocessing in sv_inference.py#17heesuallykim wants to merge 3 commits intoSUwonglab:masterfrom
heesuallykim wants to merge 3 commits intoSUwonglab:masterfrom
Conversation
Collaborator
|
@heesuallykim were you able to test this and confirm the speedup? |
Author
|
Hi, not yet - give me another 2 weeks for me to get back to you. There are a couple bugs I'm working on. I apologize for the delay! |
Collaborator
thanks! no hurry on my end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I've added multiprocessing steps to the sv_inference.py code, as well as adding an n_processes option in the CLI. I'm pasting Claude's interpretation of the multiprocessing steps below in a visual guide.
Visual Guide: How Multiprocessing Works in sv_inference
Overview Diagram
Data Flow Diagram
Multiprocessing Optimization for sv_inference.py
Overview
This document explains the multiprocessing optimizations added to
sv_inference.pyto improve runtime performance. The optimized version (sv_inference_mp.py) uses Python'smultiprocessingmodule to parallelize computationally expensive operations.Key Changes
1. Parallel Insertion Testing
The original code tested for insertions sequentially across all blocks. This is now parallelized:
Original (Sequential):
Optimized (Parallel):
Why this helps: Each block's insertion test is independent, so they can run simultaneously on different CPU cores.
2. Parallel Subgraph Processing
The most significant optimization is parallelizing subgraph processing:
Original (Sequential):
Optimized (Parallel):
Why this helps: Each subgraph represents an independent region of the genome. Processing them in parallel can dramatically reduce runtime when you have many subgraphs.
3. Automatic CPU Detection
The code automatically determines the optimal number of processes to use:
This uses all available CPU cores minus one (to keep the system responsive).
Expected Performance Improvements
Performance gains depend on your specific workload:
Insertion Testing:
Subgraph Processing:
Overall Runtime:
Important Notes
Memory Considerations
Multiprocessing creates separate processes, each with its own memory:
Limitations
Some parts remain sequential because they:
Technical Details
Worker Function Design
Each parallelized operation has a dedicated worker function:
test_single_insertion(args): Tests one block for insertionsprocess_single_subgraph(args): Processes one subgraphThese functions:
multiprocessing.Pool.map)Data Serialization
Python's multiprocessing uses
pickleto send data between processes:Debugging Tips
If you encounter issues:
Start with 1 process to verify the code works:
Check memory usage:
Enable verbose output:
Test with small regions first:
Benchmarking
To measure speedup:
Further Optimization Possibilities
Future enhancements could include:
Conclusion
The multiprocessing optimization provides significant speedups for most genomic SV calling workloads, especially when:
The changes are backward-compatible: setting
n_processesto 1 causes the code behaves like the original sequential version.