Enum derivations pipeline with auto-generated specs by Sigfried · Pull Request #291 · linkml/dm-bip

Sigfried · 2026-03-26T15:40:37Z

Summary

Auto-generate enum_derivations specs from existing value_mappings specs and inferred source enums via new generate_enum_specs.py
New generate-enum-specs pipeline step wired into pipeline.Makefile (opt-in via DM_ENUM_DERIVATIONS=true)
Self-contained toy_data_w_enums/ test directory with configs for both original and enum pipelines
Full developer reference at docs/pipeline-steps.md comparing both pipelines

Local fork dependencies

Requires unreleased fixes in schema-automator, linkml, and linkml-map for int/string enum handling. Setup: bash scripts/setup-enum-forks.sh && uv sync. See Local fork changes for details.

Merge strategy

This branch includes [tool.uv.sources] pointing to local forks. When upstream releases incorporate the fixes:

Remove [tool.uv.sources] from pyproject.toml
Pin release versions in [project.dependencies]
Merge to main

Test plan

Run make pipeline CONFIG=toy_data_w_enums/config-orig-valmaps.mk — original pipeline still works
Run make pipeline CONFIG=toy_data_w_enums/config-enums.mk — enum pipeline produces correct output
Compare mapped output between both pipelines (should be identical except None→null for unmapped enum values)
Review generated specs in output/ToyEnums/enums/enum-specs/ for correctness
Review generated target schema in output/ToyEnums/enums/enum-target-schema.yaml
Seven Bridges enclave setup (Corey)

🤖 Generated with Claude Code

Captures our understanding of the task, current state of enum handling in the pipeline, and open questions for the team before implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Explain default_range: string, expand enum_derivations key features with plain-language descriptions, clarify where source enums come from, and simplify the target schema question. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…opment

Reframe task as exploratory (test if LinkML-Map handles enum derivations), explain why pre_cleaned path is the right test case (human-readable values vs coded integers), and simplify plan into concrete steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Created minimal test in toy_data/enum_test/ with enum-enabled source schema, target schema with enums, and a spec using enum_derivations. LinkML-Map correctly maps Male→OMOP:8507, Female→OMOP:8532. Key finding: every source enum needs a derivation (use mirror_source: true for passthrough). Updated planning doc with results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Documents current pipeline and future enum derivations pipeline in table format with linked files, manual/curation steps, and notes. Includes instructions at top for completing after context refresh. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change generate_toy_data.py smoking_status from mixed int/string values ([1, 2, "Former", "Never", "Unknown"]) to all-text values (["Current", "Former", "Never", "Unknown"]). This fixes linkml-validate failures where bare numeric TSV values were parsed as integers, not matching string enum permissible values. Flesh out docs/pipeline-steps.md: separate In/Out on distinct rows, add line-specific Makefile links, add real data pointers to RTI NHLBI-BDC-DMC-HV repo, expand future pipeline table with all data columns, and document the root cause and fix for the validation error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Corey confirmed the mixed int/string smoking_status values (1, 2, "Former", "Never", "Unknown") are intentional, matching real dbGaP data patterns. A schema-automator fix for mixed types is in progress. Update docs to document this as a known issue awaiting upstream fix rather than a data generation bug. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Re-ran ToyFromRaw pipeline after reverting generate_toy_data.py to restore output files to their pre-change state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bypassed .gitignore to add output/EnumTest for dev testing

Move target_sex_enum into toy_data/target-schema.yaml (shared) and delete toy_data/enum_test/target-schema.yaml. Update enum_test config to point at the shared schema. EnumTest pipeline verified working. Simplify docs/pipeline-steps.md to focus on toy data only — removed pre-cleaned and real data columns per current scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- pipeline-steps.md: Add copy-pasteable commands for every step, document enum test pipeline using raw data path, document --infer-enum-from-integers flag, document int/string type mismatch blocker - issue-211-planning.md: Replace stale "Why pre_cleaned" section (we now use raw data), document completed work (enum derivations, --infer-enum-from-integers, pipeline wiring), add int/string blocker with question for Corey, update remaining questions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add "How map_data.py works" section with ASCII flowchart showing the full transform pipeline from schema loading through TsvLoader to ObjectTransformer.map_object and chunked output - Expand int/string blocker section with root cause (_parse_numeric in TSV loader), code references, why integer PVs can't help, and link to linkml-int-enum-repro/ minimal reproduction - Currently broken: integer-coded enums fail both validation and mapping due to _parse_numeric converting all numeric TSV values to Python ints before schema-aware code runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Explains the bug, expected vs actual output, root cause (_parse_numeric in TSV loader), and proposed fix (make the loader schema-aware). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copies the from_raw pipeline setup (raw data, specs, target schema, config) into a standalone directory. Currently uses value_mappings (identical to from_raw); enum_derivations changes will be layered on next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Uses editable installs of local forks of schema-automator, linkml, and linkml-map to test unreleased features (--infer-enum-from-integers, int/string enum fixes). Not suitable for merging to main until upstream releases incorporate these changes. Changes: - pyproject.toml/uv.lock: editable deps pointing at local forks - .gitignore: output/ un-ignored, local clone dirs added - pipeline.Makefile: DM_INFER_ENUM_FROM_INTEGERS variable - map_data.py: DataLoader accepts schema_path for type coercion - toy_data/enum_test: updated config and specs for enum derivations - new-pipeline-plan.md: plan for generate_enum_specs.py tool Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Consolidate issue-211-planning.md and new-pipeline-plan.md into a single document. Adds local fork commit inventory, enum_derivations YAML syntax reference, expanded passthrough/unreferenced enum handling, and comments strategy. Removes resolved questions and narrative. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…anning pipeline-steps.md: Restructure around toy_data_w_enums with nested-list format comparing original (value_mappings) and enum-focused pipelines. Add generate-enum-specs as step 2a, inline local fork notes at relevant steps, remove obsolete BLOCKER notes, add config examples. issue-211-planning.md: Replace notes-to-claude block and duplicated local fork section with pointer to pipeline-steps.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…d sections Renumber steps 1-5, name after Makefile targets. Add overview table comparing value_mappings and enum_derivations pipelines. Each step gets formatted CLI commands, parameter/config tables, and input/output docs. Rewrite map_data.py algorithm with SchemaView, blocks, entity discovery, and transformation operations explained with code snippets. Link generate_enum_specs algorithm to issue-211-planning.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Create config-orig-valmaps.mk (original pipeline) and config-enums.mk (enum inference + derivation generation). Separate output dirs to avoid collisions. Rename target-schema.yaml to target-schema-orig-valmaps.yaml. Revert incorrect PyCharm renames in tests/ and toy_data/ that don't use toy_data_w_enums paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New script reads source schema (with inferred enums) and existing specs (with value_mappings), generates new specs with enum_derivations and a target schema with enum definitions. Handles deduplication, disambiguation, passthrough enums, unreferenced enums, and nested object_derivations. Pipeline wiring: generate-enum-specs Makefile target runs after schema-create and before map-data when DM_ENUM_DERIVATIONS is set. Mapping step uses generated specs and target schema automatically. Verified: full enum pipeline produces identical output to value_mappings pipeline (except expected None→null for unmapped enum values). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fill in step 3 enum_derivations column with links to input files and generated outputs. Add source file links on CLI lines for prepare_input, generate_enum_specs, and map_data. Fix typos (pipline, tranform). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove Step 0 section, merge config descriptions into the intro with both make commands up front. Drop row 0 from the overview table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change pyproject.toml [tool.uv.sources] to expect all three forks (schema-automator, linkml, linkml-map) as sibling directories of dm-bip. Add scripts/setup-enum-forks.sh to clone them with correct branches and fetch upstream tags for linkml (needed for version resolution). Update pipeline-steps.md with setup/cleanup instructions for the forks. Narrow requires-python to <3.13 to avoid resolution issues with the linkml fork's dynamic versioning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Point to pipeline-steps.md, generate_enum_specs.py, and setup script. Brief description of what the enum pipeline does and how to run it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

amc-corey-cox

@Sigfried I hope this doesn't offend but it looks like this PR got pulled off-course by the LLM. The goal of #211 is proving enum_derivations work end-to-end in the pipeline with proof automated in tests/.

generate_enum_specs.py is a useful tool but it's a separate concern — split it into its own PR and we'll get this core part in separately.

There's a lot of material here that doesn't belong in the repo: planning docs, an embedded reproduction project, an AI-generated reference doc, a fork-cloning shell script. These bury the actual work and make the PR hard to review.

The dependency situation ([tool.uv.sources] pointing at local filesystem paths, unpinned deps with TODO comments) and the .gitignore regression are merge blockers — please see #290 for how I did it there.

Generally, you should also strip any descriptive comments the LLM is throwing in - that's just noise.

…urces - Remove generate_enum_specs.py (splitting to separate PR) - Remove issue-211-planning.md, linkml-int-enum-repro/, setup script, enum_test dir - Switch pyproject.toml from local filesystem paths to git URL sources - Restore output/ to .gitignore, remove local clone entries - Remove DM_ENUM_DERIVATIONS and generate-enum-specs from pipeline.Makefile - Restore direct DM_MAP_TARGET_SCHEMA/DM_TRANS_SPEC_DIR usage in map-data target - Point config-enums.mk at committed specs (with_enum_derivations/) and target-schema-enums.yaml - Point config-orig-valmaps.mk at with_value_mappings/ subdir - Strip generated comments from enum derivation spec YAML files - Rewrite test_from_enum_pipeline.py for enum pipeline with enum-specific assertions - Update docs/pipeline-steps.md and README.md for new structure Note: uv sync does not yet work with the git URL sources due to linkml's uv-dynamic-versioning fallback producing version 0.0.0, which fails transitive dependency constraints. See PR comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sigfried · 2026-03-26T22:42:25Z

@ccox-work Cleaned up most of the review feedback in 2a6c980:

Removed generate_enum_specs.py, issue-211-planning.md, linkml-int-enum-repro/, setup script, enum_test/
Switched [tool.uv.sources] from local filesystem paths to git URL sources (pinned to commit revs)
Restored output/ to .gitignore, removed local clone entries
Removed DM_ENUM_DERIVATIONS flag and generate-enum-specs target from pipeline.Makefile; map-data now uses DM_MAP_TARGET_SCHEMA and DM_TRANS_SPEC_DIR directly
Config files point at committed specs (with_enum_derivations/ and with_value_mappings/ subdirs)
Stripped generated comments from enum spec YAMLs
Rewrote test_from_enum_pipeline.py with enum-specific assertions
Updated docs and README

Blocker: `uv sync` fails with git URL sources

The linkml fork's pyproject.toml uses uv-dynamic-versioning with fallback-version = "0.0.0". When uv builds it from a git URL, it can't resolve git tags for versioning and falls back to 0.0.0. This causes a transitive dependency resolution failure:

schema-automator depends on linkml>=1.9.1,<2.0.0
linkml (from git source) resolves to version 0.0.0
→ no solution

Even after that's resolved with override-dependencies = ["linkml>=0"], the same pattern cascades to linkml-runtime (schema-automator also requires linkml-runtime>=1.9.2,<2.0.0).

This didn't happen with the local editable installs because uv could read the git history directly and compute a proper version.

Options I see:

Tag the fork branches (e.g., v1.10.0-sa-loader) so uv-dynamic-versioning produces a valid version from the git URL
Change the fork's fallback-version from "0.0.0" to something like "1.10.0.dev0"
Use a different pinning approach you may know about from your experience with Replace map_data.py with linkml-map CLI (#275) #290

Happy to go whichever direction you prefer.

amc-corey-cox · 2026-03-27T15:14:53Z

Use PEP 440 direct references in [project.dependencies] instead of [tool.uv.sources]. This sidesteps version resolution entirely:

[project.dependencies]
linkml @ git+https://github.com/Sigfried/linkml.git@<commit-or-branch>
schema-automator @ git+https://github.com/Sigfried/schema-automator.git@<commit-or-branch>
linkml-map @ git+https://github.com/Sigfried/linkml-map.git@<commit-or-branch>

Remove the [tool.uv.sources] section and any override-dependencies. See #290's pyproject.toml for the pattern.

amc-corey-cox · 2026-03-27T15:18:40Z

This is definitely more complicated for your situation. You may have to make a test branch in schema-automator or linkml-map, or both, with the dependencies for linkml from your branch there in order to push through this. I'm not really sure... but that is what I would try.

…s, add note to pipeline-steps.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All three PRs (schema-automator #188, linkml, linkml-map) are merged upstream but not yet released. Use PEP 440 direct references to upstream commit hashes — no more [tool.uv.sources] or override-dependencies. Also fix test_mapping_uses_enum_derivations to unwrap the dict output format, and update docs to reflect upstream status. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…stuff

Sigfried · 2026-04-14T15:59:53Z

@amc-corey-cox, is there anything you're waiting for me to do on this? I think I addressed your previous comments. At this point main may have changed in ways that require more conflicts to be resolved

Sigfried and others added 28 commits March 10, 2026 10:02

Add planning doc for issue #211 (enum derivations)

5b31532

Captures our understanding of the task, current state of enum handling in the pipeline, and open questions for the team before implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Clarify planning doc based on review feedback

26c24c2

Explain default_range: string, expand enum_derivations key features with plain-language descriptions, clarify where source enums come from, and simplify the target schema question. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

adding ToyFromRaw output (despite .gitignore) for review during devel…

1959f72

…opment

Restore ToyFromRaw output to match original toy data

db72d77

Re-ran ToyFromRaw pipeline after reverting generate_toy_data.py to restore output files to their pre-change state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

updated schema-automator to v0.5.4-rc2

797e23d

bypassed .gitignore to add output/EnumTest for dev testing

testing two options: enums inferred/not and infer-mixed-types yes/no

1e1fe7a

fixed formatting

9b9affc

Add README for linkml-int-enum-repro

a2e8ba2

Explains the bug, expected vs actual output, root cause (_parse_numeric in TSV loader), and proposed fix (make the loader schema-aware). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

working on planning doc

6420712

Move make commands and config intro to top of pipeline-steps.md

60b1ee2

Remove Step 0 section, merge config descriptions into the intro with both make commands up front. Drop row 0 from the overview table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add enum pipeline section to README

f2fb281

Point to pipeline-steps.md, generate_enum_specs.py, and setup script. Brief description of what the enum pipeline does and how to run it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

amc-corey-cox requested changes Mar 26, 2026

View reviewed changes

Sigfried and others added 4 commits March 27, 2026 11:57

Address PR review feedback: clarify enum_derivations vs value_mapping…

7e7fb5d

…s, add note to pipeline-steps.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into 211-enum-derivations-with-unreleased-linkml-…

8716db9

…stuff

fixed lint error

cd2bea9

Sigfried requested a review from amc-corey-cox March 27, 2026 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enum derivations pipeline with auto-generated specs#291

Enum derivations pipeline with auto-generated specs#291
Sigfried wants to merge 33 commits intomainfrom
211-enum-derivations-with-unreleased-linkml-stuff

Sigfried commented Mar 26, 2026

Uh oh!

amc-corey-cox left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sigfried commented Mar 26, 2026

Uh oh!

amc-corey-cox commented Mar 27, 2026

Uh oh!

amc-corey-cox commented Mar 27, 2026

Uh oh!

Sigfried commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sigfried commented Mar 26, 2026

Summary

Local fork dependencies

Merge strategy

Test plan

Uh oh!

amc-corey-cox left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sigfried commented Mar 26, 2026

Blocker: uv sync fails with git URL sources

Uh oh!

amc-corey-cox commented Mar 27, 2026

Uh oh!

amc-corey-cox commented Mar 27, 2026

Uh oh!

Sigfried commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Blocker: `uv sync` fails with git URL sources