Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert ranks for MIxS slots (in the context of classes) #901

Open
turbomam opened this issue Feb 19, 2025 · 3 comments
Open

assert ranks for MIxS slots (in the context of classes) #901

turbomam opened this issue Feb 19, 2025 · 3 comments

Comments

@turbomam
Copy link
Member

turbomam commented Feb 19, 2025

I generated some support files, especially

in https://github.com/microbiomedata/external-metadata-awareness/tree/main/notebooks

They probably won't stay there indefinitely

mixs_slot_rank_template.tsv is structured with the assumption that ranks would be asserted in slot_usages. @jfy133 has also proposed a mechanism by which ranks would have meaningful semantics for the millions, thousands, etc places. Presumably that would make more sense to assert on slots globally, not in slot_usage.

Or just start with the order in NCBI or EBI slot ordering in their templates? See https://www.ncbi.nlm.nih.gov/biosample/docs/packages/?format=xml

@turbomam
Copy link
Member Author

turbomam commented Feb 19, 2025

rank is a metaslot where the implementation is the responsibility of the client application (like DH or LinkML documentation pages)

grouping of "like" slots

  • single axis of similarity?

other supporting LinkML features:

  • subsets
  • slot_groups (use by DH!)

@turbomam
Copy link
Member Author

turbomam commented Feb 19, 2025

The NMDC submission-schema uses ranks and slot_groups

  analysis_type:
    name: analysis_type
    description: Select all the data types associated or available for this biosample
    title: analysis/data type
    examples:
    - value: metagenomics; metabolomics; metaproteomics
    from_schema: https://example.com/nmdc_submission_schema
    see_also:
    - MIxS:investigation_type
    rank: 3
    domain_of:
    - Biosample
    slot_group: Sample ID
    range: AnalysisTypeEnum
    recommended: true
    multivalued: true

@Woolly-at-EBI
Copy link
Collaborator

Below is a snippet from the ena_checklists_simplified.json to show the structure. The entire file is attached
(this is data extracted from an internal ENA XML file that contains all the checklists/field_groups/fields/values etc.. Publicly, the individual XML files can be seen on t'internet:
https://www.ebi.ac.uk/ena/browser/checklists and downloaded individually e.g.
curl -s https://www.ebi.ac.uk/ena/browser/api/xml/ERC000012?download=true)

In the ena_checklists_simplified.json the structure is:
"checklists"/{checklist_id}/"description"
"checklists"/{checklist_id}/"name"
"checklists"/{checklist_id}/"checklists_source" # I made this one up to makes it easy for people to parse out just the "GSC MIxS" checklists.
"checklists"/{checklist_id}/"ordered_field" # a list of the field_names ordered as they appear in our checklists (to my knowledge!), which was the original driver of this mini-task
"checklists"/{checklist_id}/"field"/{field_name}/"requirement" # mandatory|recommended|optional
"checklists"/{checklist_id}/"field"/{field_name}/"field_group"
"field_group" #list of field groups, found at least once/ # I talked about them on 25th Feb., but only shared a subset during the meeting, this is all of them. Some are a little artificial e.g. for now obsolete technical reasons we could only have about 100 field_names per field_group, so some had to effectively be split into subsets. Cleaning this up is on my to do list for later this year.

Below is a snippet of the JSON file to show some real data, and all the field_group_names (overall the checklists)
{
"checklists": {
"ERC000038": {
"description": "Shellfish contextual information associated with molecular data. The checklist has been developed in collaboration with EMBRIC Project partners.",
"name": "ENA Shellfish Checklist",
"checklists_source": "ENA",
"field": {
"Latitude Start": {
"requirement": "mandatory",
"field_group": "Marine Event"
},
"Longitude Start": {
"requirement": "mandatory",
"field_group": "Marine Event"
},
"Protocol Label": {
"requirement": "mandatory",
"field_group": "Marine Sample"
},
...
"ordered_field": [
"Latitude Start",
"Longitude Start",
"Protocol Label",
....
},
"field_group": [
"Associated host information",
"Collection event information",
"Environmental information",
"General collection event information",
"Host association",
"Host inoculation",
"Human surveillance data",
"Infraspecies information",
"Marine Environment",
"Marine Event",
"Marine Sample",
"Marine Sampling",
"Organism characteristics",
"Organism characteristics: aquatic specific",
"Organism characteristics: ecosystem",
"Organism characteristics: genetic",
"Part and developmental stage of organism",
"Pathogen testing",
"Pointer to physical material",
"Reference",
"Serology detection",
"Virus isolate information",
"bioreactor",
"building related",
"concentration measurement",
"default",
"demography",
"experimental factor and block",
"food and agriculture",
"food and agriculture: farm",
"growth medium",
"host description",
"host details",
"host disorder",
"internal environment",
"investigation and results",
"investigation experiment design",
"link",
"local environment conditions",
"local environment conditions imposed",
"local environment conditions: soil",
"local environment history",
"non-sample terms",
"non-sample terms: study or project",
"other",
"sample collection",
"sample collection: core sample properties",
"sample collection: methods, storage and transport",
"sample collection: site related",
"sample processing",
"treatment",
"unusual properties"
]

ena_checklists_simplified.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants