This repository collects, harmonizes and enriches experiment metadata for EpiATLAS. It combines multiple provenance sources (EpiRR, ENCODE, local DEEP TSVs and local Excel sheets) to produce a single, analysis-ready experiment metadata table used by downstream pipelines.
Quick summary
- Purpose: Resolve and merge antibody & experiment metadata from ENCODE, EpiRR and local sources.
- Primary outputs: EpiATLAS_experiment_metadata_extended.csv (enriched metadata) and unfiltered long data experiment_metadata_extended_long.csv, which contains all information that was gathered in this effort (not only EpiATLAS reprocessed data).
- Primary entry point: collate_experiment_metadata.ipynb (notebook) — main workflow implemented as a Jupyter notebook.
- utils.py — helper functions for HTTP requests, ENCODE/EpiRR lookups, DEEP TSV parsing, flattening XML/JSON, and dataframe utilities.
- EpiATLAS_experiment_metadata.csv — input table from SFTP that drives lookups (place your updated input here before running).
- environment.yml — conda/micromamba environment specification for reproducible installs.
- Local DEEP TSV files under deep_experiments/ following the repository's expected naming conventions.
- Optional local Excel sheets (e.g., CREST exports) for additional antibody provenance.
- EpiATLAS_experiment_metadata_extended.csv — enriched, merged experiment metadata (primary deliverable).
- experiment_metadata_extended_long.csv, which contains all information that was gathered in this effort (not only EpiATLAS reprocessed data) unfiltered and uncleaned.
- encode_roadmap_antibody_dicts.json — cached per-experiment antibody extracts from the encode portal.
- epirr_experiment_data.json — cached EpiRR experiment responses.
- Load the input EpiATLAS_experiment_metadata.csv into a Polars DataFrame and validate basic assumptions (no entirely-null columns, expected key columns present).
- Download the IHEC metadata spec from GitHub and materialize a mapping of expected columns per assay.
- Query EpiRR for experiment details (caching responses to epirr_experiment_data.json) and normalize returned keys to match the EpiATLAS schema.
- Discover and read local experiment metadata files (EGAX XML/JSON, DEEP TSVs, CREST Excel, EpiHK files, etc.), flattening nested JSON/XML into tables and adding a
file/PRIMARY_IDprovenance column. - Coerce and normalize assay and experiment type labels (e.g., map
Bisulfite-Seq→WGBS, normalize RNA-Seq subtypes) so joins are consistent. - Merge EpiRR-derived records with local files using
PRIMARY_IDandexperiment_typewhere available; fall back to merging byepirr_id_without_version+assay_typewhenexperiment_typeis missing. - Enrich merged rows with provenance-specific lookups:
- fetch per-experiment antibody metadata from ENCODE / Roadmap (cached to encode_roadmap_antibody_dicts.json) and join on
PRIMARY_ID; - ingest local DEEP, CREST, EpiHK, GIS and manual CEMT metadata and coalesce into the master table.
- Resolve duplicated and near-duplicate column names (case and minor spelling variants) — group and coalesce known groups and use fuzzy matching (Levenshtein) to find candidates for manual review.
- Coalesce column groups so each logical metadata field has a single canonical column; enforce that at most one non-null value exists per canonical group per row.
- Collapse and aggregate enrichment columns into list-valued fields where multiple provenance sources supply different values; drop placeholder/null tokens and normalize empty lists to null.
- Validate that the enriched dataset still maps 1:1 to the input EpiATLAS experiments (no accidental loss or uncontrolled duplication), compute per-column completeness statistics, and produce diagnostic plots (heatmaps of completeness by centre/assay).
- Export the final deliverables:
- experiment_metadata_extended_long.csv, which contains all information that was gathered in this effort (not only EpiATLAS reprocessed data) unfiltered and uncleaned.
- EpiATLAS_experiment_metadata_extended.csv (enriched, merged metadata),
Notes:
- The notebook is iterative — cached JSONs and locally persisted intermediate CSVs speed repeated development.
- Any new provenance source should add a loader in
utils.py, a small mapping into the merge logic, and accompanying tests for uniqueness/coalescing.
- The first runs will query external APIs (ENCODE, EpiRR). To speed iterated development, the notebook creates and reuses epirr_experiment_data.json and encode_roadmap_antibody_dicts.json.
- utils.py contains the HTTP retry logic and the TSV / JSON/XML flattening utilities. If you add new provenance sources, extend
utils.pyand the notebook mapping tables.
open an issue
(README created with generative AI and corrected by Quirin Manz.)