Skip to content

Enum derivations pipeline with auto-generated specs#291

Draft
Sigfried wants to merge 33 commits intomainfrom
211-enum-derivations-with-unreleased-linkml-stuff
Draft

Enum derivations pipeline with auto-generated specs#291
Sigfried wants to merge 33 commits intomainfrom
211-enum-derivations-with-unreleased-linkml-stuff

Conversation

@Sigfried
Copy link
Copy Markdown
Collaborator

Summary

  • Auto-generate enum_derivations specs from existing value_mappings specs and inferred source enums via new generate_enum_specs.py
  • New generate-enum-specs pipeline step wired into pipeline.Makefile (opt-in via DM_ENUM_DERIVATIONS=true)
  • Self-contained toy_data_w_enums/ test directory with configs for both original and enum pipelines
  • Full developer reference at docs/pipeline-steps.md comparing both pipelines

Local fork dependencies

Requires unreleased fixes in schema-automator, linkml, and linkml-map for int/string enum handling. Setup: bash scripts/setup-enum-forks.sh && uv sync. See Local fork changes for details.

Merge strategy

This branch includes [tool.uv.sources] pointing to local forks. When upstream releases incorporate the fixes:

  1. Remove [tool.uv.sources] from pyproject.toml
  2. Pin release versions in [project.dependencies]
  3. Merge to main

Test plan

  • Run make pipeline CONFIG=toy_data_w_enums/config-orig-valmaps.mk — original pipeline still works
  • Run make pipeline CONFIG=toy_data_w_enums/config-enums.mk — enum pipeline produces correct output
  • Compare mapped output between both pipelines (should be identical except Nonenull for unmapped enum values)
  • Review generated specs in output/ToyEnums/enums/enum-specs/ for correctness
  • Review generated target schema in output/ToyEnums/enums/enum-target-schema.yaml
  • Seven Bridges enclave setup (Corey)

🤖 Generated with Claude Code

Sigfried and others added 28 commits March 10, 2026 10:02
Captures our understanding of the task, current state of enum handling
in the pipeline, and open questions for the team before implementation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain default_range: string, expand enum_derivations key features
with plain-language descriptions, clarify where source enums come from,
and simplify the target schema question.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reframe task as exploratory (test if LinkML-Map handles enum derivations),
explain why pre_cleaned path is the right test case (human-readable values
vs coded integers), and simplify plan into concrete steps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Created minimal test in toy_data/enum_test/ with enum-enabled source
schema, target schema with enums, and a spec using enum_derivations.
LinkML-Map correctly maps Male→OMOP:8507, Female→OMOP:8532. Key
finding: every source enum needs a derivation (use mirror_source: true
for passthrough). Updated planning doc with results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents current pipeline and future enum derivations pipeline in
table format with linked files, manual/curation steps, and notes.
Includes instructions at top for completing after context refresh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change generate_toy_data.py smoking_status from mixed int/string values
([1, 2, "Former", "Never", "Unknown"]) to all-text values (["Current",
"Former", "Never", "Unknown"]). This fixes linkml-validate failures where
bare numeric TSV values were parsed as integers, not matching string enum
permissible values.

Flesh out docs/pipeline-steps.md: separate In/Out on distinct rows,
add line-specific Makefile links, add real data pointers to RTI
NHLBI-BDC-DMC-HV repo, expand future pipeline table with all data
columns, and document the root cause and fix for the validation error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corey confirmed the mixed int/string smoking_status values (1, 2,
"Former", "Never", "Unknown") are intentional, matching real dbGaP
data patterns. A schema-automator fix for mixed types is in progress.

Update docs to document this as a known issue awaiting upstream fix
rather than a data generation bug.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-ran ToyFromRaw pipeline after reverting generate_toy_data.py to
restore output files to their pre-change state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bypassed .gitignore to add output/EnumTest for dev testing
Move target_sex_enum into toy_data/target-schema.yaml (shared) and
delete toy_data/enum_test/target-schema.yaml. Update enum_test config
to point at the shared schema. EnumTest pipeline verified working.

Simplify docs/pipeline-steps.md to focus on toy data only — removed
pre-cleaned and real data columns per current scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- pipeline-steps.md: Add copy-pasteable commands for every step, document
  enum test pipeline using raw data path, document --infer-enum-from-integers
  flag, document int/string type mismatch blocker
- issue-211-planning.md: Replace stale "Why pre_cleaned" section (we now use
  raw data), document completed work (enum derivations, --infer-enum-from-integers,
  pipeline wiring), add int/string blocker with question for Corey, update
  remaining questions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "How map_data.py works" section with ASCII flowchart showing
  the full transform pipeline from schema loading through TsvLoader
  to ObjectTransformer.map_object and chunked output
- Expand int/string blocker section with root cause (_parse_numeric
  in TSV loader), code references, why integer PVs can't help, and
  link to linkml-int-enum-repro/ minimal reproduction
- Currently broken: integer-coded enums fail both validation and
  mapping due to _parse_numeric converting all numeric TSV values
  to Python ints before schema-aware code runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explains the bug, expected vs actual output, root cause
(_parse_numeric in TSV loader), and proposed fix (make the
loader schema-aware).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copies the from_raw pipeline setup (raw data, specs, target schema, config)
into a standalone directory. Currently uses value_mappings (identical to
from_raw); enum_derivations changes will be layered on next.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uses editable installs of local forks of schema-automator, linkml, and
linkml-map to test unreleased features (--infer-enum-from-integers,
int/string enum fixes). Not suitable for merging to main until upstream
releases incorporate these changes.

Changes:
- pyproject.toml/uv.lock: editable deps pointing at local forks
- .gitignore: output/ un-ignored, local clone dirs added
- pipeline.Makefile: DM_INFER_ENUM_FROM_INTEGERS variable
- map_data.py: DataLoader accepts schema_path for type coercion
- toy_data/enum_test: updated config and specs for enum derivations
- new-pipeline-plan.md: plan for generate_enum_specs.py tool

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consolidate issue-211-planning.md and new-pipeline-plan.md into a single
document. Adds local fork commit inventory, enum_derivations YAML syntax
reference, expanded passthrough/unreferenced enum handling, and comments
strategy. Removes resolved questions and narrative.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…anning

pipeline-steps.md: Restructure around toy_data_w_enums with nested-list
format comparing original (value_mappings) and enum-focused pipelines.
Add generate-enum-specs as step 2a, inline local fork notes at relevant
steps, remove obsolete BLOCKER notes, add config examples.

issue-211-planning.md: Replace notes-to-claude block and duplicated local
fork section with pointer to pipeline-steps.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d sections

Renumber steps 1-5, name after Makefile targets. Add overview table
comparing value_mappings and enum_derivations pipelines. Each step gets
formatted CLI commands, parameter/config tables, and input/output docs.
Rewrite map_data.py algorithm with SchemaView, blocks, entity discovery,
and transformation operations explained with code snippets. Link
generate_enum_specs algorithm to issue-211-planning.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create config-orig-valmaps.mk (original pipeline) and config-enums.mk
(enum inference + derivation generation). Separate output dirs to avoid
collisions. Rename target-schema.yaml to target-schema-orig-valmaps.yaml.
Revert incorrect PyCharm renames in tests/ and toy_data/ that don't use
toy_data_w_enums paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New script reads source schema (with inferred enums) and existing specs
(with value_mappings), generates new specs with enum_derivations and a
target schema with enum definitions. Handles deduplication, disambiguation,
passthrough enums, unreferenced enums, and nested object_derivations.

Pipeline wiring: generate-enum-specs Makefile target runs after
schema-create and before map-data when DM_ENUM_DERIVATIONS is set.
Mapping step uses generated specs and target schema automatically.

Verified: full enum pipeline produces identical output to value_mappings
pipeline (except expected None→null for unmapped enum values).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fill in step 3 enum_derivations column with links to input files and
generated outputs. Add source file links on CLI lines for prepare_input,
generate_enum_specs, and map_data. Fix typos (pipline, tranform).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove Step 0 section, merge config descriptions into the intro with
both make commands up front. Drop row 0 from the overview table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change pyproject.toml [tool.uv.sources] to expect all three forks
(schema-automator, linkml, linkml-map) as sibling directories of dm-bip.
Add scripts/setup-enum-forks.sh to clone them with correct branches and
fetch upstream tags for linkml (needed for version resolution).

Update pipeline-steps.md with setup/cleanup instructions for the forks.
Narrow requires-python to <3.13 to avoid resolution issues with the
linkml fork's dynamic versioning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point to pipeline-steps.md, generate_enum_specs.py, and setup script.
Brief description of what the enum pipeline does and how to run it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@amc-corey-cox amc-corey-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sigfried I hope this doesn't offend but it looks like this PR got pulled off-course by the LLM. The goal of #211 is proving enum_derivations work end-to-end in the pipeline with proof automated in tests/.

generate_enum_specs.py is a useful tool but it's a separate concern — split it into its own PR and we'll get this core part in separately.

There's a lot of material here that doesn't belong in the repo: planning docs, an embedded reproduction project, an AI-generated reference doc, a fork-cloning shell script. These bury the actual work and make the PR hard to review.

The dependency situation ([tool.uv.sources] pointing at local filesystem paths, unpinned deps with TODO comments) and the .gitignore regression are merge blockers — please see #290 for how I did it there.

Generally, you should also strip any descriptive comments the LLM is throwing in - that's just noise.

Comment thread docs/pipeline-steps.md
Comment thread linkml-int-enum-repro/README.md Outdated
Comment thread scripts/setup-enum-forks.sh Outdated
Comment thread src/dm_bip/map_data/map_data.py
Comment thread toy_data/enum_test/specs/person-spec.yaml Outdated
Comment thread src/dm_bip/generate_enum_specs.py Outdated
Comment thread .gitignore Outdated
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml Outdated
Comment thread src/dm_bip/generate_enum_specs.py Outdated
…urces

- Remove generate_enum_specs.py (splitting to separate PR)
- Remove issue-211-planning.md, linkml-int-enum-repro/, setup script, enum_test dir
- Switch pyproject.toml from local filesystem paths to git URL sources
- Restore output/ to .gitignore, remove local clone entries
- Remove DM_ENUM_DERIVATIONS and generate-enum-specs from pipeline.Makefile
- Restore direct DM_MAP_TARGET_SCHEMA/DM_TRANS_SPEC_DIR usage in map-data target
- Point config-enums.mk at committed specs (with_enum_derivations/) and target-schema-enums.yaml
- Point config-orig-valmaps.mk at with_value_mappings/ subdir
- Strip generated comments from enum derivation spec YAML files
- Rewrite test_from_enum_pipeline.py for enum pipeline with enum-specific assertions
- Update docs/pipeline-steps.md and README.md for new structure

Note: uv sync does not yet work with the git URL sources due to
linkml's uv-dynamic-versioning fallback producing version 0.0.0,
which fails transitive dependency constraints. See PR comment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Sigfried
Copy link
Copy Markdown
Collaborator Author

@ccox-work Cleaned up most of the review feedback in 2a6c980:

  • Removed generate_enum_specs.py, issue-211-planning.md, linkml-int-enum-repro/, setup script, enum_test/
  • Switched [tool.uv.sources] from local filesystem paths to git URL sources (pinned to commit revs)
  • Restored output/ to .gitignore, removed local clone entries
  • Removed DM_ENUM_DERIVATIONS flag and generate-enum-specs target from pipeline.Makefile; map-data now uses DM_MAP_TARGET_SCHEMA and DM_TRANS_SPEC_DIR directly
  • Config files point at committed specs (with_enum_derivations/ and with_value_mappings/ subdirs)
  • Stripped generated comments from enum spec YAMLs
  • Rewrote test_from_enum_pipeline.py with enum-specific assertions
  • Updated docs and README

Blocker: uv sync fails with git URL sources

The linkml fork's pyproject.toml uses uv-dynamic-versioning with fallback-version = "0.0.0". When uv builds it from a git URL, it can't resolve git tags for versioning and falls back to 0.0.0. This causes a transitive dependency resolution failure:

schema-automator depends on linkml>=1.9.1,<2.0.0
linkml (from git source) resolves to version 0.0.0
→ no solution

Even after that's resolved with override-dependencies = ["linkml>=0"], the same pattern cascades to linkml-runtime (schema-automator also requires linkml-runtime>=1.9.2,<2.0.0).

This didn't happen with the local editable installs because uv could read the git history directly and compute a proper version.

Options I see:

  1. Tag the fork branches (e.g., v1.10.0-sa-loader) so uv-dynamic-versioning produces a valid version from the git URL
  2. Change the fork's fallback-version from "0.0.0" to something like "1.10.0.dev0"
  3. Use a different pinning approach you may know about from your experience with Replace map_data.py with linkml-map CLI (#275) #290

Happy to go whichever direction you prefer.

@amc-corey-cox
Copy link
Copy Markdown
Collaborator

Use PEP 440 direct references in [project.dependencies] instead of [tool.uv.sources]. This sidesteps version resolution entirely:

[project.dependencies]
linkml @ git+https://github.com/Sigfried/linkml.git@<commit-or-branch>
schema-automator @ git+https://github.com/Sigfried/schema-automator.git@<commit-or-branch>
linkml-map @ git+https://github.com/Sigfried/linkml-map.git@<commit-or-branch>

Remove the [tool.uv.sources] section and any override-dependencies. See #290's pyproject.toml for the pattern.

@amc-corey-cox
Copy link
Copy Markdown
Collaborator

This is definitely more complicated for your situation. You may have to make a test branch in schema-automator or linkml-map, or both, with the dependencies for linkml from your branch there in order to push through this. I'm not really sure... but that is what I would try.

Sigfried and others added 4 commits March 27, 2026 11:57
…s, add note to pipeline-steps.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All three PRs (schema-automator #188, linkml, linkml-map) are merged
upstream but not yet released. Use PEP 440 direct references to upstream
commit hashes — no more [tool.uv.sources] or override-dependencies.

Also fix test_mapping_uses_enum_derivations to unwrap the dict output
format, and update docs to reflect upstream status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Sigfried Sigfried requested a review from amc-corey-cox March 27, 2026 17:10
@Sigfried
Copy link
Copy Markdown
Collaborator Author

@amc-corey-cox, is there anything you're waiting for me to do on this? I think I addressed your previous comments. At this point main may have changed in ways that require more conflicts to be resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants