EpiATLAS — Experiment Metadata Collation

This repository collects, harmonizes and enriches experiment metadata for EpiATLAS. It combines multiple provenance sources (EpiRR, ENCODE, local DEEP TSVs and local Excel sheets) to produce a single, analysis-ready experiment metadata table used by downstream pipelines.

Quick summary

Purpose: Resolve and merge antibody & experiment metadata from ENCODE, EpiRR and local sources.
Primary outputs: EpiATLAS_experiment_metadata_extended.csv (enriched metadata) and unfiltered long data experiment_metadata_extended_long.csv, which contains all information that was gathered in this effort (not only EpiATLAS reprocessed data).
Primary entry point: collate_experiment_metadata.ipynb (notebook) — main workflow implemented as a Jupyter notebook.
utils.py — helper functions for HTTP requests, ENCODE/EpiRR lookups, DEEP TSV parsing, flattening XML/JSON, and dataframe utilities.
EpiATLAS_experiment_metadata.csv — input table from SFTP that drives lookups (place your updated input here before running).
environment.yml — conda/micromamba environment specification for reproducible installs.

Additional Inputs

Local DEEP TSV files under deep_experiments/ following the repository's expected naming conventions.
Optional local Excel sheets (e.g., CREST exports) for additional antibody provenance.

Outputs

EpiATLAS_experiment_metadata_extended.csv — enriched, merged experiment metadata (primary deliverable).
experiment_metadata_extended_long.csv, which contains all information that was gathered in this effort (not only EpiATLAS reprocessed data) unfiltered and uncleaned.
encode_roadmap_antibody_dicts.json — cached per-experiment antibody extracts from the encode portal.
epirr_experiment_data.json — cached EpiRR experiment responses.

How it works (high level)

Load the input EpiATLAS_experiment_metadata.csv into a Polars DataFrame and validate basic assumptions (no entirely-null columns, expected key columns present).
Download the IHEC metadata spec from GitHub and materialize a mapping of expected columns per assay.
Query EpiRR for experiment details (caching responses to epirr_experiment_data.json) and normalize returned keys to match the EpiATLAS schema.
Discover and read local experiment metadata files (EGAX XML/JSON, DEEP TSVs, CREST Excel, EpiHK files, etc.), flattening nested JSON/XML into tables and adding a file/PRIMARY_ID provenance column.
Coerce and normalize assay and experiment type labels (e.g., map Bisulfite-Seq → WGBS, normalize RNA-Seq subtypes) so joins are consistent.
Merge EpiRR-derived records with local files using PRIMARY_ID and experiment_type where available; fall back to merging by epirr_id_without_version + assay_type when experiment_type is missing.
Enrich merged rows with provenance-specific lookups:

fetch per-experiment antibody metadata from ENCODE / Roadmap (cached to encode_roadmap_antibody_dicts.json) and join on PRIMARY_ID;
ingest local DEEP, CREST, EpiHK, GIS and manual CEMT metadata and coalesce into the master table.

Resolve duplicated and near-duplicate column names (case and minor spelling variants) — group and coalesce known groups and use fuzzy matching (Levenshtein) to find candidates for manual review.
Coalesce column groups so each logical metadata field has a single canonical column; enforce that at most one non-null value exists per canonical group per row.
Collapse and aggregate enrichment columns into list-valued fields where multiple provenance sources supply different values; drop placeholder/null tokens and normalize empty lists to null.
Validate that the enriched dataset still maps 1:1 to the input EpiATLAS experiments (no accidental loss or uncontrolled duplication), compute per-column completeness statistics, and produce diagnostic plots (heatmaps of completeness by centre/assay).
Export the final deliverables:

experiment_metadata_extended_long.csv, which contains all information that was gathered in this effort (not only EpiATLAS reprocessed data) unfiltered and uncleaned.
EpiATLAS_experiment_metadata_extended.csv (enriched, merged metadata),

Notes:

The notebook is iterative — cached JSONs and locally persisted intermediate CSVs speed repeated development.
Any new provenance source should add a loader in utils.py, a small mapping into the merge logic, and accompanying tests for uniqueness/coalescing.

Notes on caches and development

The first runs will query external APIs (ENCODE, EpiRR). To speed iterated development, the notebook creates and reuses epirr_experiment_data.json and encode_roadmap_antibody_dicts.json.
utils.py contains the HTTP retry logic and the TSV / JSON/XML flattening utilities. If you add new provenance sources, extend utils.py and the notebook mapping tables.

Contact

open an issue

(README created with generative AI and corrected by Quirin Manz.)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
deep_experiments		deep_experiments
.gitignore		.gitignore
CREST.meta_v3_251024.xlsx		CREST.meta_v3_251024.xlsx
CREST.meta_v4_251107.xlsx		CREST.meta_v4_251107.xlsx
CREST.meta_v5.xlsx		CREST.meta_v5.xlsx
EpiATLAS_experiment_metadata.csv		EpiATLAS_experiment_metadata.csv
EpiATLAS_experiment_metadata_extended.csv		EpiATLAS_experiment_metadata_extended.csv
README.md		README.md
collate_experiment_metadata.ipynb		collate_experiment_metadata.ipynb
collate_experiment_metadata.py		collate_experiment_metadata.py
encode_roadmap_antibody_dicts.json		encode_roadmap_antibody_dicts.json
environment.yml		environment.yml
epihk_data_chip.tsv		epihk_data_chip.tsv
epihk_data_rna.tsv		epihk_data_rna.tsv
epihk_data_wgbs.tsv		epihk_data_wgbs.tsv
epirr_experiment_data.json		epirr_experiment_data.json
experiment_metadata_extended_long.csv		experiment_metadata_extended_long.csv
ihec_spec.json		ihec_spec.json
local_epirr_experiment_data_json.csv		local_epirr_experiment_data_json.csv
local_epirr_experiment_data_xml.csv		local_epirr_experiment_data_xml.csv
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EpiATLAS — Experiment Metadata Collation

Additional Inputs

Outputs

How it works (high level)

Notes on caches and development

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EpiATLAS — Experiment Metadata Collation

Additional Inputs

Outputs

How it works (high level)

Notes on caches and development

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages