Skip to content

IHEC/EpiATLAS_experiment_metadata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EpiATLAS — Experiment Metadata Collation

This repository collects, harmonizes and enriches experiment metadata for EpiATLAS. It combines multiple provenance sources (EpiRR, ENCODE, local DEEP TSVs and local Excel sheets) to produce a single, analysis-ready experiment metadata table used by downstream pipelines.

Quick summary

Additional Inputs

  • Local DEEP TSV files under deep_experiments/ following the repository's expected naming conventions.
  • Optional local Excel sheets (e.g., CREST exports) for additional antibody provenance.

Outputs

How it works (high level)

  1. Load the input EpiATLAS_experiment_metadata.csv into a Polars DataFrame and validate basic assumptions (no entirely-null columns, expected key columns present).
  2. Download the IHEC metadata spec from GitHub and materialize a mapping of expected columns per assay.
  3. Query EpiRR for experiment details (caching responses to epirr_experiment_data.json) and normalize returned keys to match the EpiATLAS schema.
  4. Discover and read local experiment metadata files (EGAX XML/JSON, DEEP TSVs, CREST Excel, EpiHK files, etc.), flattening nested JSON/XML into tables and adding a file/PRIMARY_ID provenance column.
  5. Coerce and normalize assay and experiment type labels (e.g., map Bisulfite-SeqWGBS, normalize RNA-Seq subtypes) so joins are consistent.
  6. Merge EpiRR-derived records with local files using PRIMARY_ID and experiment_type where available; fall back to merging by epirr_id_without_version + assay_type when experiment_type is missing.
  7. Enrich merged rows with provenance-specific lookups:
  • fetch per-experiment antibody metadata from ENCODE / Roadmap (cached to encode_roadmap_antibody_dicts.json) and join on PRIMARY_ID;
  • ingest local DEEP, CREST, EpiHK, GIS and manual CEMT metadata and coalesce into the master table.
  1. Resolve duplicated and near-duplicate column names (case and minor spelling variants) — group and coalesce known groups and use fuzzy matching (Levenshtein) to find candidates for manual review.
  2. Coalesce column groups so each logical metadata field has a single canonical column; enforce that at most one non-null value exists per canonical group per row.
  3. Collapse and aggregate enrichment columns into list-valued fields where multiple provenance sources supply different values; drop placeholder/null tokens and normalize empty lists to null.
  4. Validate that the enriched dataset still maps 1:1 to the input EpiATLAS experiments (no accidental loss or uncontrolled duplication), compute per-column completeness statistics, and produce diagnostic plots (heatmaps of completeness by centre/assay).
  5. Export the final deliverables:

Notes:

  • The notebook is iterative — cached JSONs and locally persisted intermediate CSVs speed repeated development.
  • Any new provenance source should add a loader in utils.py, a small mapping into the merge logic, and accompanying tests for uniqueness/coalescing.

Notes on caches and development

  • The first runs will query external APIs (ENCODE, EpiRR). To speed iterated development, the notebook creates and reuses epirr_experiment_data.json and encode_roadmap_antibody_dicts.json.
  • utils.py contains the HTTP retry logic and the TSV / JSON/XML flattening utilities. If you add new provenance sources, extend utils.py and the notebook mapping tables.

Contact

open an issue


(README created with generative AI and corrected by Quirin Manz.)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors