TrackOverlayML Pipeline

Note: Refactored and documented with GitHub Copilot assistance. Based on ATLAS Collaboration code for ML-driven track overlay routing.

Overview

Train a neural network to intelligently route ATLAS simulation events:

MC-overlay: Full simulation (accurate but slow)
Track-overlay: Fast simulation (approximation)
Goal: Use Track-overlay when it matches MC-overlay (MatchProb > 0.5), otherwise use MC-overlay

Data Access

The framework requires ATLAS simulation data in specific formats:

Option 1: Pre-processed HDF5 files (Ready for training)

/eos/user/f/fatsai/TrackOverlayDATA/matched_JZ7W_data.h5
/eos/user/f/fatsai/TrackOverlayDATA/unmatched_JZ7W_data.h5

Option 2: Raw CSV files (For full preprocessing pipeline)

/eos/user/f/fatsai/TrackOverlayDATA/MCOverlay_JZ7W/*.csv
/eos/user/f/fatsai/TrackOverlayDATA/TrackOverlay_JZ7W/*.csv

Access: These datasets are stored on CERN EOS and require ATLAS collaboration access rights.

To request access:

Contact: [email protected]

Expected directory structure:

data/
├── MC-overlay_JZ7W/
│   ├── file1.csv
│   ├── file2.csv
│   └── ...
└── Track-overlay_JZ7W/
    ├── file1.csv
    ├── file2.csv
    └── ...

For different samples, use the sample name in the directory:

data/
├── MC-overlay_ttbar/
│   └── *.csv
├── Track-overlay_ttbar/
│   └── *.csv
├── MC-overlay_JZ7W/
│   └── *.csv
└── Track-overlay_JZ7W/
    └── *.csv

Setting up your data:

# Create directories for your sample
mkdir -p data/MC-overlay_ttbar
mkdir -p data/Track-overlay_ttbar

and copy or link your CSV files accordingly.

Quick Start

Recommended: Using Singularity

The easiest way to run this framework is using the pre-built Singularity container, which includes all dependencies:

# Pull the container (only needed once)
singularity pull docker://fyingtsai/dsnnr_4gpu:v5
# or on Perlmutter
podman-hpc pull docker://fyingtsai/dsnnr_4gpu:v5

Alternative: Local Installation

If you cannot use Singularity, install dependencies locally:

Option A: Using uv

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate
#uv pip install -e .
uv run python scripts/prepare_data.py --sample JZ7W --path data

Option B: Using Conda

# Create environment from file
conda env create -f environment.yml --prefix /path/to/your/scratch/trackoverlay-ml
# Activate environment
conda activate /path/to/your/scratch/trackoverlay-ml

Option C: Using pip

pip install "tensorflow>=2.8.0" "numpy>=1.21.0" "pandas>=1.3.0" "scikit-learn>=1.0.0" "matplotlib>=3.5.0,<3.9.0" "seaborn>=0.11.0" "tables>=3.7.0" "statsmodels>=0.13.0" "mplhep>=0.3.28,<0.4.0" "xarray>=0.20.0"

Note: All examples in this README assume Singularity usage. For local installation, remove the singularity exec dsnnr_4gpu_v5.sif prefix.

Example:

# Full pipeline
singularity exec dsnnr_4gpu_v5.sif python scripts/run_pipeline.py --sample JZ7W --epochs 5

# Or run steps individually (Recommended. Run each step individually for easier debugging and better control)
singularity exec dsnnr_4gpu_v5.sif python scripts/prepare_data.py --sample JZ7W --path data
singularity exec dsnnr_4gpu_v5.sif python scripts/train_model.py --sample JZ7W --path data --epochs 5
singularity exec dsnnr_4gpu_v5.sif python scripts/evaluate_model.py --sample JZ7W

More Training examples

# Train on balanced 10k + 10k
singularity exec dsnnr_4gpu_v5.sif python scripts/train_model.py --sample ttbar --matched_size 10000 --unmatched_size 10000

# Train on realistic imbalanced ratio (1:10)
singularity exec dsnnr_4gpu_v5.sif python scripts/train_model.py --sample ttbar --matched_size 5000 --unmatched_size 50000

# Use all matched, but limit unmatched
singularity exec dsnnr_4gpu_v5.sif python scripts/train_model.py --sample ttbar --unmatched_size 20000

# Full pipeline with balanced training
singularity exec dsnnr_4gpu_v5.sif python scripts/run_pipeline.py --stage all --sample ttbar --path data --matched_size 5000 --unmatched_size 5000 --epochs 20

# Full pipeline with cross-sample evaluation
singularity exec dsnnr_4gpu_v5.sif python scripts/run_pipeline.py --stage all --sample ttbar --eval_sample JZ7W

# Just train on subset
singularity exec dsnnr_4gpu_v5.sif python scripts/run_pipeline.py --stage train --sample ttbar --matched_size 10000 --unmatched_size 10000

Project Structure

TrackOverlayML/
├── data/                       # Data directory (--path to customize)
│   ├── MC-overlay_{sample}/    # MC workflow CSVs (required, if not yet have a h5 dataframe)
│   ├── Track-overlay_{sample}/ # Track workflow CSVs (required, if not yet have a h5 dataframe)
│   ├── matched_{sample}_data.h5    # Good matches (pre-created)
│   └── unmatched_{sample}_data.h5  # Poor matches (pre-created)
├── scripts/                    # Main entry points
│   ├── prepare_data.py         # Merge MC/Track, compute features
│   ├── train_model.py          # Train classifier
│   ├── evaluate_model.py       # Evaluate performance
│   └── run_pipeline.py         # Run all steps
├── network/classifier.py       # Model architecture
├── utils/                      # Evaluation & plotting
└── results/                    # Outputs (models, plots, logs)

Usage

Individual Steps (best for those just getting started)

# Step 1: Prepare data (merge MC/Track workflows)
singularity exec dsnnr_4gpu_v5.sif python scripts/prepare_data.py --sample ttbar --trainsplit 0.8

# Step 2: Train model
singularity exec dsnnr_4gpu_v5.sif python scripts/train_model.py --sample ttbar --epochs 200

# Step 3: Evaluate (same sample)
singularity exec dsnnr_4gpu_v5.sif python scripts/evaluate_model.py --sample ttbar

# Step 3b: Evaluate on different sample
singularity exec dsnnr_4gpu_v5.sif python scripts/evaluate_model.py --sample ttbar --eval_sample JZ7W

Common Workflows

Train multiple models on same data:

singularity exec dsnnr_4gpu_v5.sif python scripts/prepare_data.py --sample ttbar
singularity exec dsnnr_4gpu_v5.sif python scripts/train_model.py --sample ttbar --layers 32 16 8
singularity exec dsnnr_4gpu_v5.sif python scripts/train_model.py --sample ttbar --layers 64 32 16

Cross-sample evaluation:

# Train on ttbar, test on JZ7W
singularity exec dsnnr_4gpu_v5.sif python scripts/train_model.py --sample ttbar
singularity exec dsnnr_4gpu_v5.sif python scripts/prepare_data.py --sample JZ7W
singularity exec dsnnr_4gpu_v5.sif python scripts/evaluate_model.py --sample ttbar --eval_sample JZ7W

Quick evaluation on subset:

singularity exec dsnnr_4gpu_v5.sif python scripts/evaluate_model.py --sample ttbar --matched_size 5000 --unmatched_size 50000

Key Arguments

Argument	Default	Description
`--path`	`data`	Data directory path
`--sample`	`JZ7W`	Sample name (ttbar, JZ7W, etc.)
`--eval_sample`	None	Different sample for evaluation
`--trainsplit`	0.8	Train/test split ratio
`--epochs`	100	Training epochs
`--batchsize`	80	Batch size
`--lr`	0.001	Learning rate
`--layers`	45 35 30	Hidden layer sizes
`--patience`	20	Early stopping patience
`--rouletter`	`smart`	Roulette type (smart/hard)
`--matched_size`	None	Limit matched samples for training/eval
`--unmatched_size`	None	Limit unmatched samples for training/eval

Run python scripts/run_pipeline.py --help for full list.

Data Flow

MC-overlay_{sample}/        Track-overlay_{sample}/
└── *.csv                   └── *.csv
         ↓                           ↓
         └──── Merge on EventNumber ─┘
                      ↓
          Create labels (MatchProb > 0.5)
                      ↓
    matched_*.h5 (good) & unmatched_*.h5 (poor)
                      ↓
              Train/Test split
                      ↓
            Train classifier
                      ↓
          Evaluate performance

Output

results/{sample}/
├── classifier/
│   ├── classifier.h5           # Trained model
│   └── history.pkl             # Training history
├── logs/                       # Logs for each step
└── {xscore}/{rouletter}/
    └── plots/                  # ROC, efficiency, fraction plots

Notes

Matched (TargetLabel=1): MatchProb > 0.5 (Track-overlay accurate)
Unmatched (TargetLabel=0): MatchProb ≤ 0.5 (needs MC-overlay)
Preprocessed HDF5 files are cached for faster reruns

Contributing

When making changes:

Keep function docstrings updated
Add inline comments for complex physics calculations
Update this README if workflow changes

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
network		network
scripts		scripts
utils		utils
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TrackOverlayML Pipeline

Overview

Data Access

Option 1: Pre-processed HDF5 files (Ready for training)

Option 2: Raw CSV files (For full preprocessing pipeline)

Quick Start

Recommended: Using Singularity

Alternative: Local Installation

More Training examples

Project Structure

Usage

Individual Steps (best for those just getting started)

Common Workflows

Key Arguments

Data Flow

Output

Notes

Contributing

About

Uh oh!

Releases

Packages

Languages

License

fyingtsai/TrackOverlayML

Folders and files

Latest commit

History

Repository files navigation

TrackOverlayML Pipeline

Overview

Data Access

Option 1: Pre-processed HDF5 files (Ready for training)

Option 2: Raw CSV files (For full preprocessing pipeline)

Quick Start

Recommended: Using Singularity

Alternative: Local Installation

More Training examples

Project Structure

Usage

Individual Steps (best for those just getting started)

Common Workflows

Key Arguments

Data Flow

Output

Notes

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages