Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document current linkml2schemasheets-template setup and output #900

Open
turbomam opened this issue Feb 18, 2025 · 2 comments
Open

document current linkml2schemasheets-template setup and output #900

turbomam opened this issue Feb 18, 2025 · 2 comments
Assignees

Comments

@turbomam
Copy link
Member

turbomam commented Feb 18, 2025

There have been questions about what metaslots or "attributes" should be asserted on MIxS slots/terms. The linkml2schemasheets-template command line tool at least helps discover what metaslots have been used in the past.

For a list of all metaslots that could be used on a LinkML slot, see

@turbomam turbomam self-assigned this Feb 18, 2025
@turbomam
Copy link
Member Author

turbomam commented Feb 18, 2025

@cmungall and I discussed this briefly. He prefers this kind of discovery, as opposed to a statistical linter (which we had also considered for MIxS)


Schema Sheet Reports

Report Generation

The schema sheet reports are generated using the linkml2schemasheets-template command form the schema-sheets Python package, which has two targets in the project.Makefile:

  • assets/mixs_derived_class_term_schemasheet.tsv
  • assets/mixs-schemasheets-concise.tsv

Perhaps we should remove one of those targets, or clarify why it's helpful to have both.

Debug and Log Output

The report generation process creates TSV output, a log and a debugging file:

Log File

A few types of logging messages are generated:

From the schemasheets.generate_populate logger:

  • "Redefining" messages noting that is_a and mixins can have both class and slot ranges

From the root logger:

  • Imports processing
  • Skipping of:
    • Enums
    • PermissibleValues
    • Prefixes
    • Subsets
    • Types
  • Various "TODO" items (of questionable actionability)
Debug File (YAML)
  • Blacklisted metaslots: attributes, instantiates, name, slot_usage, slots
  • Tracking of skipped items:
    • blacklist_skipped: None reported
    • untemplateable_skipped: None reported
    • untemplateables: No entries

Report Details

  • Both reports are generated using the same configuration but save different log and debug files
  • File timestamps:
    • mixs_derived_class_term_schemasheet.tsv: Feb 13, 2024 (637,581 bytes)
    • mixs-schemasheets-concise-global-slots.tsv: Oct 8, 2024 (264,831 bytes)

Report Structure

mixs_derived_class_term_schemasheet.tsv

  • Contains multiple header rows (4) following schemasheets standard
  • First header row is most commonly useful

mixs-schemasheets-concise-global-slots.tsv

  • Column 1: Slot names
  • Column 2: Classes using slot with slot_usage modifications
  • Column 3: Pipe-delimited examples
  • Column 4: Structured pattern
  • Column 5 & 6: Annotation values
  • Columns 7+: Metaslots
    • whose range can be expressed as a string or pipe-delimited list of strings

Note: This file has been pre-processed to remove class-centric rows and class-centric columns for easier slot analysis.

Metaslot Usage

Metaslots Used for MIxS Terms/Slots

  1. aliases
  2. annotations
    • These are key value pairs. There are no constraints on what keys can be used, which makes annotations risky in terms of keepting the schema normalized.
    • By convention, the MIxS schema only uses the two annotation keys described below.
  3. comments
  4. description
    • We do not currently have a style guide for descriptions (or any other metaslot!)
    • The descriptions are very heterogeneous and frequently contain material that should really go in examples, pattern, etc.
  5. examples
    • Note that multiple examples can (and probably should) be provided.
    • Nothing automatically checks if these examples actually satisfy the constrains of the slot.
    • LinkML provides a linkml-run-examples command that can check the schema against a directory of valid examples and a directory of invalid examples. That has been very useful for managing the nmdc-schema and is recommended for MIxS.
  6. in_subset
  7. keywords
    • @turbomam asserted keywords via natural language processing when the MixS 6.2 schema was inferred from a spreadsheet.
    • There's no mechanism right now for applying keywords from a standardized vocabulary to new MIxS slots/terms
  8. multivalued
  9. pattern
  10. range
  11. recommended
  12. required
  13. slot_uri
  14. structured_pattern
  15. title

Annotations Keys Used for MIxS Terms/Slots

Via the annotations metaslot

  • Expected_value
    • These annotations should be reformulated as ranges, patterns, structured_patterns, comments etc.
  • Preferred_unit
    • MIxS has traditionally allowed or even encouraged multiple different units for each slot. LinkML does provide a good mechanism for asserting a single unit for slots, but not multiple different units. For now, this will probably have to remain as an annotation, but this could be an area of future normalization.

Metaslots only used by classes in MIxS

  • class_uri
  • is_a
  • mixin
  • mixins
  • owner
  • tree_root

Notes on Specific Metaslots

  • aliases column may be present in some reports but should never be asserted (different from aliases metaslot)
  • domain_of and owner: Inferred (for classes) by the LinkML schema loader, not asserted
  • from_schema: Inferred by the LinkML schema loader, not asserted
  • inlined and inlined_as_list are not applicable to Extensions or Checklists. They are only used for slots like soil_data which are only used by the MixsCompliantData class, which is only present in the schema to support the data validation infrastructure.
  • Many terms currently use string_serialization but we shouldn't continue that practice. Use pattern or structured_pattern instead
  • Never assert both a pattern and a structured_pattern. The curly-bracket enclosed elements of structured_patterns must be defined in the settings section of the schema. structured_patterns are currently expanded into regular expression patterns with the gen-linkml tool.
    • We probably shouldn't include expanded patterns in the source of truth, like src/mixs/schema/mixs.yaml, since that might give the impression that editing those patterns would be productive.
      • There are probably a few directly-asserted patterns in the schema now. Is that confusing?

Metaslot Usage Statistics

Metaslot Count
slot_uri 770
title 770
description 767
keywords 661
examples values 574
range 551
pattern 273
structured_pattern 271
Preferred_unit 236
Expected_value 226
string_serialization 219
multivalued 149
in_subset 98
recommended 55
required 35
comments 12

Distribution of Populated Metaslots

Number of metaslots Count of slots
8 228
6 163
7 161
9 132
5 61
10 16
11 7
4 2

Examples

  • rel_air_humidity: Example of a slot with 11 of the (14 metaslots or 2 annotation keys) populated
  • gestation_state: Example of a slot with only 4 metaslots populated

@turbomam
Copy link
Member Author

turbomam commented Feb 19, 2025

Finding read-only metaslots:

yq '.slots | to_entries | map(select(.value.readonly) | .key)' meta.yaml | sort
  • definition_uri
  • domain_of
  • from_schema
  • generation_date
  • imported_from
  • is_usage_slot
  • metamodel_version
  • owner
  • source_file
  • source_file_date
  • source_file_size
  • usage_slot_name

https://github.com/linkml/linkml-model/blob/main/linkml_model/model/schema/meta.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant