This repository provides Apache Airflow-based data-processing pipelines.
Airflow pipelines are defined as DAGs (Directed Acyclic Graphs) in Python, allowing for flexible, modular, and maintainable workflows.
Currently we have one pipeline for automated mass spectrometry file conversion using ProteoWizard.
It is designed to watch a shared storage directory for new raw .d directories, convert them to open formats (mzML or mzXML), and optionally archive the original data.
.
├── dags/ # All DAGs live here
│ └── msconvert_dag.py # Main conversion DAG
├── docker-compose.yml # Airflow deployment configuration
├── Dockerfile # Optional image for testing / debugging
├── requirements.txt
└── README.md
dags/– each DAG is a self-contained Python file. Airflow automatically loads valid DAGs from here.docker-compose.yml– sets up Airflow components and mounts required directories.Dockerfile– optional base image for testing conversion locally.requirements.txt– Python dependencies for optional utilities.
- Docker and Docker Compose installed.
- A host directory with incoming mass spectrometry data in
.dformat. - A .env file to set environment variables needed by Airflow. This needs minimally to have
FERNET_KEYandSECRET_KEYdefined. - A pre-seeded Wine prefix with
msconvert.exeinstalled. run:
docker run --rm -d --name seed_wine proteowizard/pwiz-skyline-i-agree-to-the-vendor-licenses sleep infinity
sudo docker cp seed_wine:/wineprefix64 /var/lib/msconvert/wineprefix64sudo docker compose up -dThen open http://localhost:8080 to access the Airflow UI.
(Default credentials are defined in docker-compose.yml as admin / admin.)
Before running the DAG, set the following Airflow Variables (in Admin → Variables or via CLI):
| Variable Name | Example Value | Description |
|---|---|---|
MS_HOST_DATA_DIR |
/mnt/aichemyrds/live/roar_ms_data |
Host path to the watched data directory |
MS_WATCH_DIR |
/data |
Path inside container to watch for .d dirs |
MS_OUTPUT_DIR |
/data/mzML |
Output directory for converted files |
MS_ARCHIVE_DIR |
/data/archives |
Where to store archived originals |
MS_HOST_WINECACHE_DIR |
/var/lib/msconvert/wineprefix64 |
Pre-seeded wineprefix with msconvert.exe |
MS_FORMAT |
mzML |
Output format (mzML or mzXML) |
Other optional variables exist (e.g. concurrency pool, archiving policy).
Check dags/msconvert_dag.py for the full list.
The main DAG is called msconvert_watch_simple.
- Ensure your input data directory (e.g.
/mnt/aichemyrds/live/roar_ms_data) is mounted and contains.ddirectories. - Enable the DAG in the Airflow UI.
- Either let it run on its 2-minute polling schedule or trigger it manually.
- Monitor logs and task progress in the UI.
The msconvert_watch_simple DAG performs:
- Discover New Runs – scans the watch directory for
.ddirectories not yet converted. - Wait Until Quiet – ensures the directory is no longer being written to.
- Convert One – uses
DockerOperatorto launch a container runningmsconvert.exeunder Wine, outputting.mzMLor.mzXML. - Archive Original (optional) – tars and gzips the source directory, then optionally deletes it.
To add new pipelines:
- Create a new Python file in
dags/(e.g.new_pipeline_dag.py). - Define a DAG object following Airflow conventions.
- Airflow will auto-discover the new DAG within a minute or after a scheduler restart.
- Set any required Variables.
- Trigger via UI or API.
# Bring stack up / down
sudo docker compose up -d
sudo docker compose down
# View container logs
sudo docker compose logs -f airflow-scheduler
sudo docker compose logs -f airflow-webserver
# Manage Airflow Variables
sudo docker compose exec airflow-scheduler airflow variables list✅ Note: This deployment uses the LocalExecutor with a single scheduler. For higher throughput, you can scale using CeleryExecutor or KubernetesExecutor.