Enum derivations pipeline with auto-generated specs#291
Enum derivations pipeline with auto-generated specs#291
Conversation
Captures our understanding of the task, current state of enum handling in the pipeline, and open questions for the team before implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain default_range: string, expand enum_derivations key features with plain-language descriptions, clarify where source enums come from, and simplify the target schema question. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reframe task as exploratory (test if LinkML-Map handles enum derivations), explain why pre_cleaned path is the right test case (human-readable values vs coded integers), and simplify plan into concrete steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Created minimal test in toy_data/enum_test/ with enum-enabled source schema, target schema with enums, and a spec using enum_derivations. LinkML-Map correctly maps Male→OMOP:8507, Female→OMOP:8532. Key finding: every source enum needs a derivation (use mirror_source: true for passthrough). Updated planning doc with results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents current pipeline and future enum derivations pipeline in table format with linked files, manual/curation steps, and notes. Includes instructions at top for completing after context refresh. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change generate_toy_data.py smoking_status from mixed int/string values ([1, 2, "Former", "Never", "Unknown"]) to all-text values (["Current", "Former", "Never", "Unknown"]). This fixes linkml-validate failures where bare numeric TSV values were parsed as integers, not matching string enum permissible values. Flesh out docs/pipeline-steps.md: separate In/Out on distinct rows, add line-specific Makefile links, add real data pointers to RTI NHLBI-BDC-DMC-HV repo, expand future pipeline table with all data columns, and document the root cause and fix for the validation error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corey confirmed the mixed int/string smoking_status values (1, 2, "Former", "Never", "Unknown") are intentional, matching real dbGaP data patterns. A schema-automator fix for mixed types is in progress. Update docs to document this as a known issue awaiting upstream fix rather than a data generation bug. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-ran ToyFromRaw pipeline after reverting generate_toy_data.py to restore output files to their pre-change state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bypassed .gitignore to add output/EnumTest for dev testing
Move target_sex_enum into toy_data/target-schema.yaml (shared) and delete toy_data/enum_test/target-schema.yaml. Update enum_test config to point at the shared schema. EnumTest pipeline verified working. Simplify docs/pipeline-steps.md to focus on toy data only — removed pre-cleaned and real data columns per current scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- pipeline-steps.md: Add copy-pasteable commands for every step, document enum test pipeline using raw data path, document --infer-enum-from-integers flag, document int/string type mismatch blocker - issue-211-planning.md: Replace stale "Why pre_cleaned" section (we now use raw data), document completed work (enum derivations, --infer-enum-from-integers, pipeline wiring), add int/string blocker with question for Corey, update remaining questions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "How map_data.py works" section with ASCII flowchart showing the full transform pipeline from schema loading through TsvLoader to ObjectTransformer.map_object and chunked output - Expand int/string blocker section with root cause (_parse_numeric in TSV loader), code references, why integer PVs can't help, and link to linkml-int-enum-repro/ minimal reproduction - Currently broken: integer-coded enums fail both validation and mapping due to _parse_numeric converting all numeric TSV values to Python ints before schema-aware code runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explains the bug, expected vs actual output, root cause (_parse_numeric in TSV loader), and proposed fix (make the loader schema-aware). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copies the from_raw pipeline setup (raw data, specs, target schema, config) into a standalone directory. Currently uses value_mappings (identical to from_raw); enum_derivations changes will be layered on next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uses editable installs of local forks of schema-automator, linkml, and linkml-map to test unreleased features (--infer-enum-from-integers, int/string enum fixes). Not suitable for merging to main until upstream releases incorporate these changes. Changes: - pyproject.toml/uv.lock: editable deps pointing at local forks - .gitignore: output/ un-ignored, local clone dirs added - pipeline.Makefile: DM_INFER_ENUM_FROM_INTEGERS variable - map_data.py: DataLoader accepts schema_path for type coercion - toy_data/enum_test: updated config and specs for enum derivations - new-pipeline-plan.md: plan for generate_enum_specs.py tool Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consolidate issue-211-planning.md and new-pipeline-plan.md into a single document. Adds local fork commit inventory, enum_derivations YAML syntax reference, expanded passthrough/unreferenced enum handling, and comments strategy. Removes resolved questions and narrative. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…anning pipeline-steps.md: Restructure around toy_data_w_enums with nested-list format comparing original (value_mappings) and enum-focused pipelines. Add generate-enum-specs as step 2a, inline local fork notes at relevant steps, remove obsolete BLOCKER notes, add config examples. issue-211-planning.md: Replace notes-to-claude block and duplicated local fork section with pointer to pipeline-steps.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d sections Renumber steps 1-5, name after Makefile targets. Add overview table comparing value_mappings and enum_derivations pipelines. Each step gets formatted CLI commands, parameter/config tables, and input/output docs. Rewrite map_data.py algorithm with SchemaView, blocks, entity discovery, and transformation operations explained with code snippets. Link generate_enum_specs algorithm to issue-211-planning.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create config-orig-valmaps.mk (original pipeline) and config-enums.mk (enum inference + derivation generation). Separate output dirs to avoid collisions. Rename target-schema.yaml to target-schema-orig-valmaps.yaml. Revert incorrect PyCharm renames in tests/ and toy_data/ that don't use toy_data_w_enums paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New script reads source schema (with inferred enums) and existing specs (with value_mappings), generates new specs with enum_derivations and a target schema with enum definitions. Handles deduplication, disambiguation, passthrough enums, unreferenced enums, and nested object_derivations. Pipeline wiring: generate-enum-specs Makefile target runs after schema-create and before map-data when DM_ENUM_DERIVATIONS is set. Mapping step uses generated specs and target schema automatically. Verified: full enum pipeline produces identical output to value_mappings pipeline (except expected None→null for unmapped enum values). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fill in step 3 enum_derivations column with links to input files and generated outputs. Add source file links on CLI lines for prepare_input, generate_enum_specs, and map_data. Fix typos (pipline, tranform). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove Step 0 section, merge config descriptions into the intro with both make commands up front. Drop row 0 from the overview table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change pyproject.toml [tool.uv.sources] to expect all three forks (schema-automator, linkml, linkml-map) as sibling directories of dm-bip. Add scripts/setup-enum-forks.sh to clone them with correct branches and fetch upstream tags for linkml (needed for version resolution). Update pipeline-steps.md with setup/cleanup instructions for the forks. Narrow requires-python to <3.13 to avoid resolution issues with the linkml fork's dynamic versioning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point to pipeline-steps.md, generate_enum_specs.py, and setup script. Brief description of what the enum pipeline does and how to run it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
amc-corey-cox
left a comment
There was a problem hiding this comment.
@Sigfried I hope this doesn't offend but it looks like this PR got pulled off-course by the LLM. The goal of #211 is proving enum_derivations work end-to-end in the pipeline with proof automated in tests/.
generate_enum_specs.py is a useful tool but it's a separate concern — split it into its own PR and we'll get this core part in separately.
There's a lot of material here that doesn't belong in the repo: planning docs, an embedded reproduction project, an AI-generated reference doc, a fork-cloning shell script. These bury the actual work and make the PR hard to review.
The dependency situation ([tool.uv.sources] pointing at local filesystem paths, unpinned deps with TODO comments) and the .gitignore regression are merge blockers — please see #290 for how I did it there.
Generally, you should also strip any descriptive comments the LLM is throwing in - that's just noise.
…urces - Remove generate_enum_specs.py (splitting to separate PR) - Remove issue-211-planning.md, linkml-int-enum-repro/, setup script, enum_test dir - Switch pyproject.toml from local filesystem paths to git URL sources - Restore output/ to .gitignore, remove local clone entries - Remove DM_ENUM_DERIVATIONS and generate-enum-specs from pipeline.Makefile - Restore direct DM_MAP_TARGET_SCHEMA/DM_TRANS_SPEC_DIR usage in map-data target - Point config-enums.mk at committed specs (with_enum_derivations/) and target-schema-enums.yaml - Point config-orig-valmaps.mk at with_value_mappings/ subdir - Strip generated comments from enum derivation spec YAML files - Rewrite test_from_enum_pipeline.py for enum pipeline with enum-specific assertions - Update docs/pipeline-steps.md and README.md for new structure Note: uv sync does not yet work with the git URL sources due to linkml's uv-dynamic-versioning fallback producing version 0.0.0, which fails transitive dependency constraints. See PR comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@ccox-work Cleaned up most of the review feedback in 2a6c980:
Blocker:
|
|
Use PEP 440 direct references in [project.dependencies]
linkml @ git+https://github.com/Sigfried/linkml.git@<commit-or-branch>
schema-automator @ git+https://github.com/Sigfried/schema-automator.git@<commit-or-branch>
linkml-map @ git+https://github.com/Sigfried/linkml-map.git@<commit-or-branch>Remove the |
|
This is definitely more complicated for your situation. You may have to make a test branch in schema-automator or linkml-map, or both, with the dependencies for linkml from your branch there in order to push through this. I'm not really sure... but that is what I would try. |
…s, add note to pipeline-steps.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All three PRs (schema-automator #188, linkml, linkml-map) are merged upstream but not yet released. Use PEP 440 direct references to upstream commit hashes — no more [tool.uv.sources] or override-dependencies. Also fix test_mapping_uses_enum_derivations to unwrap the dict output format, and update docs to reflect upstream status. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@amc-corey-cox, is there anything you're waiting for me to do on this? I think I addressed your previous comments. At this point main may have changed in ways that require more conflicts to be resolved |
Summary
enum_derivationsspecs from existingvalue_mappingsspecs and inferred source enums via newgenerate_enum_specs.pygenerate-enum-specspipeline step wired intopipeline.Makefile(opt-in viaDM_ENUM_DERIVATIONS=true)toy_data_w_enums/test directory with configs for both original and enum pipelinesdocs/pipeline-steps.mdcomparing both pipelinesLocal fork dependencies
Requires unreleased fixes in schema-automator, linkml, and linkml-map for int/string enum handling. Setup:
bash scripts/setup-enum-forks.sh && uv sync. See Local fork changes for details.Merge strategy
This branch includes
[tool.uv.sources]pointing to local forks. When upstream releases incorporate the fixes:[tool.uv.sources]frompyproject.toml[project.dependencies]Test plan
make pipeline CONFIG=toy_data_w_enums/config-orig-valmaps.mk— original pipeline still worksmake pipeline CONFIG=toy_data_w_enums/config-enums.mk— enum pipeline produces correct outputNone→nullfor unmapped enum values)output/ToyEnums/enums/enum-specs/for correctnessoutput/ToyEnums/enums/enum-target-schema.yaml🤖 Generated with Claude Code