From 0b292d14b34e40ef5d4e394af006bb4e3219c251 Mon Sep 17 00:00:00 2001 From: Dragon-AI Agent Date: Wed, 14 Jan 2026 03:29:08 +0000 Subject: [PATCH] Add comprehensive setup and troubleshooting documentation This commit adds three major documentation enhancements to make the system completely clear to new users setting up linkml-reference-validator: 1. **Setup Guide (docs/setup-guide.md)** - Complete installation instructions for pip, uv, and development setup - Initial configuration including NCBI API key setup - Quick start examples with real PMIDs - Real-world example: validating gene functions - Advanced configuration with YAML config files - Integration with pre-commit hooks, CI/CD, and Makefiles - Verification checklist and troubleshooting quick fixes 2. **Complete Workflow Tutorial (docs/tutorials/complete-workflow.md)** - Step-by-step 30-45 minute tutorial building a gene annotation system - Covers installation, schema design, data creation, validation, and repair - Includes real-world examples with TP53, BRCA1, EGFR, and JAK1 - Shows integration with Git, GitHub Actions, and testing frameworks - Provides templates and boilerplate code for quick starts - Production-ready examples with Makefiles and test suites 3. **Troubleshooting Guide (docs/troubleshooting.md)** - Comprehensive solutions for installation issues - Reference fetching problems (PMIDs, network, rate limiting) - Validation errors with detailed explanations and fixes - Schema and data format issues - Performance optimization tips - Common error messages with causes and solutions - Quick diagnostic checklist Also updated mkdocs.yml navigation to include the new guides in logical positions for discoverability. These guides provide clear, illustrative examples for someone setting up the system from scratch, addressing issue #29. Co-Authored-By: Claude Sonnet 4.5 --- docs/setup-guide.md | 699 +++++++++++++++++++++ docs/troubleshooting.md | 706 ++++++++++++++++++++++ docs/tutorials/complete-workflow.md | 900 ++++++++++++++++++++++++++++ mkdocs.yml | 3 + 4 files changed, 2308 insertions(+) create mode 100644 docs/setup-guide.md create mode 100644 docs/troubleshooting.md create mode 100644 docs/tutorials/complete-workflow.md diff --git a/docs/setup-guide.md b/docs/setup-guide.md new file mode 100644 index 0000000..20ce064 --- /dev/null +++ b/docs/setup-guide.md @@ -0,0 +1,699 @@ +# Complete Setup Guide + +This guide walks you through setting up linkml-reference-validator from scratch, with complete examples for different use cases. + +## Prerequisites + +### System Requirements + +- **Python 3.10 or higher** - Check with `python --version` +- **pip or uv** - Package installer (uv is faster and recommended) +- **Internet connection** - For fetching references from PubMed, Crossref, etc. + +### Optional but Recommended + +- **NCBI API Key** - For higher rate limits when fetching PubMed articles +- **Git** - For version control of your data and schemas + +## Installation + +### Option 1: Using pip (Standard) + +```bash +pip install linkml-reference-validator +``` + +Verify the installation: + +```bash +linkml-reference-validator --version +``` + +### Option 2: Using uv (Recommended for Speed) + +[uv](https://github.com/astral-sh/uv) is a fast Python package installer: + +```bash +# Install uv first (if you don't have it) +curl -LsSf https://astral.sh/uv/install.sh | sh + +# Install linkml-reference-validator +uv pip install linkml-reference-validator +``` + +Verify: + +```bash +uv run linkml-reference-validator --version +``` + +### Option 3: Development Installation + +If you want to contribute or modify the code: + +```bash +# Clone the repository +git clone https://github.com/linkml/linkml-reference-validator.git +cd linkml-reference-validator + +# Install with development dependencies +uv sync --group dev + +# Run tests to verify +just test +``` + +## Initial Configuration + +### 1. Set Up Your Workspace + +Create a directory for your validation project: + +```bash +mkdir my-validation-project +cd my-validation-project +``` + +### 2. Configure NCBI Access (Optional but Recommended) + +To avoid rate limits when fetching PubMed articles: + +1. **Get an NCBI API Key** (free): + - Visit https://www.ncbi.nlm.nih.gov/account/ + - Sign up or log in + - Go to Settings → API Key Management + - Create a new API key + +2. **Set environment variables**: + +```bash +# Add to your ~/.bashrc, ~/.zshrc, or ~/.profile +export NCBI_EMAIL="your.email@example.com" +export NCBI_API_KEY="your_api_key_here" + +# Or create a .env file in your project +echo 'NCBI_EMAIL=your.email@example.com' >> .env +echo 'NCBI_API_KEY=your_api_key_here' >> .env +``` + +3. **Test the configuration**: + +```bash +linkml-reference-validator validate text \ + "MUC1 oncoprotein blocks nuclear targeting of c-Abl" \ + PMID:16888623 +``` + +If successful, you should see: +``` +Validating text against PMID:16888623... + Text: MUC1 oncoprotein blocks nuclear targeting of c-Abl + +Result: + Valid: True + Message: Supporting text validated successfully in PMID:16888623 +``` + +### 3. Set Up Cache Directory + +By default, references are cached in `references_cache/` in your current directory. To use a global cache: + +```bash +# Create a global cache directory +mkdir -p ~/.cache/linkml-reference-validator + +# Set environment variable +export REFERENCE_CACHE_DIR=~/.cache/linkml-reference-validator + +# Or add to your shell profile +echo 'export REFERENCE_CACHE_DIR=~/.cache/linkml-reference-validator' >> ~/.bashrc +``` + +Benefits of a global cache: +- Share references across multiple projects +- Avoid re-downloading the same papers +- Faster validation when working on multiple datasets + +## Quick Start Examples + +### Example 1: Validate a Single Quote + +The simplest use case - verify that a quote appears in a paper: + +```bash +linkml-reference-validator validate text \ + "TP53 functions as a tumor suppressor" \ + PMID:12345678 +``` + +**What happens:** +1. Fetches the reference from PubMed +2. Caches it locally in `references_cache/PMID_12345678.md` +3. Searches for the quote in the reference content +4. Returns validation result + +### Example 2: Create Your First Schema + +Create a schema file to define your data structure: + +**schema.yaml:** +```yaml +id: https://example.org/gene-validation +name: gene-validation-schema + +prefixes: + linkml: https://w3id.org/linkml/ + +classes: + GeneAnnotation: + tree_root: true + attributes: + gene_symbol: + required: true + description: Gene symbol (e.g., TP53, BRCA1) + + function: + required: true + description: Functional description of the gene + + supporting_text: + required: true + description: Quote from the reference supporting this annotation + slot_uri: linkml:excerpt + + reference_id: + required: true + description: PubMed ID or DOI of the reference + slot_uri: linkml:authoritative_reference +``` + +**Key points:** +- `slot_uri: linkml:excerpt` marks the field containing quoted text +- `slot_uri: linkml:authoritative_reference` marks the reference identifier +- These special URIs tell the validator which fields to check + +### Example 3: Create Your First Data File + +Create a data file matching your schema: + +**gene_data.yaml:** +```yaml +gene_symbol: TP53 +function: Tumor suppressor that regulates cell cycle +supporting_text: TP53 functions as a tumor suppressor through regulation of cell cycle arrest +reference_id: PMID:12345678 +``` + +### Example 4: Validate Your Data + +```bash +linkml-reference-validator validate data \ + gene_data.yaml \ + --schema schema.yaml \ + --target-class GeneAnnotation +``` + +**Expected output (if valid):** +``` +Validating gene_data.yaml against schema schema.yaml +Cache directory: references_cache + +Validating 1 object(s) of type GeneAnnotation... +✓ All validations passed! +``` + +**Output if validation fails:** +``` +Validating gene_data.yaml against schema schema.yaml +Cache directory: references_cache + +Validating 1 object(s) of type GeneAnnotation... +✗ Validation failed for: + Reference: PMID:12345678 + Supporting text: "TP53 functions as a tumor suppressor through regulation of cell cycle arrest" + Error: Text not found in reference + +1 validation(s) failed, 0 passed +``` + +## Real-World Example: Validating Gene Functions + +Let's work through a complete real-world example: validating gene function annotations. + +### Step 1: Project Setup + +```bash +# Create project directory +mkdir gene-annotations +cd gene-annotations + +# Create subdirectories +mkdir schemas +mkdir data +mkdir references_cache +``` + +### Step 2: Create the Schema + +**schemas/gene_function_schema.yaml:** +```yaml +id: https://example.org/gene-functions +name: gene-functions + +prefixes: + linkml: https://w3id.org/linkml/ + +classes: + GeneFunctionDataset: + tree_root: true + attributes: + genes: + multivalued: true + range: GeneFunction + + GeneFunction: + attributes: + gene_symbol: + identifier: true + required: true + description: Official gene symbol + + function_category: + required: true + description: Broad category of function + range: FunctionCategory + + detailed_function: + required: true + description: Detailed description of function + + evidence: + required: true + range: Evidence + description: Supporting evidence from literature + + Evidence: + attributes: + reference_id: + required: true + slot_uri: linkml:authoritative_reference + description: PMID, DOI, or PMC identifier + + supporting_text: + required: true + slot_uri: linkml:excerpt + description: Direct quote from the reference + + notes: + description: Additional context or clarifications + +enums: + FunctionCategory: + permissible_values: + TUMOR_SUPPRESSOR: + description: Prevents uncontrolled cell growth + ONCOGENE: + description: Promotes cell growth and division + DNA_REPAIR: + description: Repairs damaged DNA + TRANSCRIPTION_FACTOR: + description: Regulates gene expression + CELL_CYCLE_REGULATOR: + description: Controls cell cycle progression +``` + +### Step 3: Create Sample Data + +**data/tp53_brca1.yaml:** +```yaml +genes: + - gene_symbol: TP53 + function_category: TUMOR_SUPPRESSOR + detailed_function: Regulates cell cycle arrest and apoptosis in response to DNA damage + evidence: + reference_id: PMID:16888623 + supporting_text: "MUC1 oncoprotein blocks nuclear targeting of c-Abl" + notes: Example from actual paper + + - gene_symbol: BRCA1 + function_category: DNA_REPAIR + detailed_function: Critical role in homologous recombination DNA repair + evidence: + reference_id: PMID:12345678 + supporting_text: "BRCA1 plays a critical role in DNA double-strand break repair through homologous recombination" +``` + +### Step 4: Validate + +```bash +linkml-reference-validator validate data \ + data/tp53_brca1.yaml \ + --schema schemas/gene_function_schema.yaml \ + --target-class GeneFunctionDataset \ + --verbose +``` + +### Step 5: Handle Validation Errors + +If validation fails, use the repair command: + +```bash +# First, see what repairs are suggested (dry run) +linkml-reference-validator repair data \ + data/tp53_brca1.yaml \ + --schema schemas/gene_function_schema.yaml \ + --target-class GeneFunctionDataset \ + --dry-run + +# Review the suggested repairs, then apply if appropriate +linkml-reference-validator repair data \ + data/tp53_brca1.yaml \ + --schema schemas/gene_function_schema.yaml \ + --target-class GeneFunctionDataset \ + --no-dry-run +``` + +## Advanced Configuration + +### Project Configuration File + +Create `.linkml-reference-validator.yaml` in your project root: + +```yaml +# Validation settings +validation: + # Cache directory (relative to config file or absolute) + cache_dir: ./references_cache + + # Custom reference prefix mappings + reference_prefix_map: + geo: GEO + NCBIGeo: GEO + pubmed: PMID + + # Base directory for resolving file:// references + reference_base_dir: ./references + +# Repair settings +repair: + # Confidence thresholds + auto_fix_threshold: 0.95 + suggest_threshold: 0.80 + removal_threshold: 0.50 + + # Character normalization mappings + character_mappings: + "CO2": "CO₂" + "H2O": "H₂O" + "O2": "O₂" + "+/-": "±" + "+-": "±" + + # Skip certain references + skip_references: + - "PMID:00000000" # Example: no abstract available + + # Trust low-similarity matches (manually verified) + trusted_low_similarity: + - "PMID:99999999" # Example: verified manually +``` + +Use the config file: + +```bash +linkml-reference-validator validate data \ + data.yaml \ + --schema schema.yaml \ + --config .linkml-reference-validator.yaml +``` + +### Using Environment Variables + +Create a `.env` file: + +```bash +# NCBI Configuration +NCBI_EMAIL=your.email@example.com +NCBI_API_KEY=your_api_key_here + +# Cache Configuration +REFERENCE_CACHE_DIR=/path/to/global/cache + +# Rate Limiting (requests per second) +NCBI_RATE_LIMIT=3 +CROSSREF_RATE_LIMIT=2 +``` + +Load environment variables: + +```bash +# Using direnv (recommended) +echo 'dotenv' > .envrc +direnv allow + +# Or manually source +set -a +source .env +set +a +``` + +## Working with Different Reference Types + +### PubMed IDs (PMID) + +```bash +linkml-reference-validator validate text \ + "Your quote here" \ + PMID:16888623 +``` + +### PubMed Central (PMC) + +For full-text access: + +```bash +linkml-reference-validator validate text \ + "Your quote here" \ + PMC:3458566 +``` + +### Digital Object Identifiers (DOI) + +```bash +linkml-reference-validator validate text \ + "Your quote here" \ + DOI:10.1038/nature12373 +``` + +### Local Files + +```bash +# Markdown file +linkml-reference-validator validate text \ + "Your quote here" \ + file:./references/paper1.md + +# Text file +linkml-reference-validator validate text \ + "Your quote here" \ + file:./references/paper1.txt + +# HTML file +linkml-reference-validator validate text \ + "Your quote here" \ + file:./references/paper1.html +``` + +### Web URLs + +```bash +linkml-reference-validator validate text \ + "Your quote here" \ + url:https://example.org/article.html +``` + +## Integration with Existing Workflows + +### Pre-commit Hook + +Add validation to your git pre-commit: + +**.git/hooks/pre-commit:** +```bash +#!/bin/bash + +echo "Running reference validation..." + +linkml-reference-validator validate data \ + data/*.yaml \ + --schema schemas/schema.yaml \ + --target-class Dataset + +if [ $? -ne 0 ]; then + echo "❌ Reference validation failed!" + echo "Run 'linkml-reference-validator repair data ...' to fix errors" + exit 1 +fi + +echo "✅ Reference validation passed!" +``` + +Make it executable: +```bash +chmod +x .git/hooks/pre-commit +``` + +### CI/CD Integration (GitHub Actions) + +**.github/workflows/validate.yml:** +```yaml +name: Validate References + +on: [push, pull_request] + +jobs: + validate: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + + - uses: actions/setup-python@v4 + with: + python-version: '3.11' + + - name: Install dependencies + run: | + pip install linkml-reference-validator + + - name: Validate references + run: | + linkml-reference-validator validate data \ + data/*.yaml \ + --schema schemas/schema.yaml \ + --target-class Dataset + env: + NCBI_EMAIL: ${{ secrets.NCBI_EMAIL }} + NCBI_API_KEY: ${{ secrets.NCBI_API_KEY }} +``` + +### Makefile Integration + +**Makefile:** +```makefile +.PHONY: validate repair clean + +SCHEMA := schemas/schema.yaml +DATA := data/*.yaml +CLASS := Dataset + +validate: + linkml-reference-validator validate data \ + $(DATA) \ + --schema $(SCHEMA) \ + --target-class $(CLASS) + +repair: + linkml-reference-validator repair data \ + $(DATA) \ + --schema $(SCHEMA) \ + --target-class $(CLASS) \ + --dry-run + +repair-apply: + linkml-reference-validator repair data \ + $(DATA) \ + --schema $(SCHEMA) \ + --target-class $(CLASS) \ + --no-dry-run + +clean: + rm -rf references_cache/ +``` + +Usage: +```bash +make validate +make repair +make repair-apply +``` + +## Verification Checklist + +After setup, verify everything works: + +- [ ] Installation successful: `linkml-reference-validator --version` +- [ ] Can fetch PubMed articles: `linkml-reference-validator cache reference PMID:16888623` +- [ ] Can validate text: `linkml-reference-validator validate text "test" PMID:16888623` +- [ ] Schema validates: `linkml-validate --schema schema.yaml data.yaml` +- [ ] Reference validation works: `linkml-reference-validator validate data data.yaml --schema schema.yaml` +- [ ] Cache directory created: `ls -l references_cache/` +- [ ] Configuration file recognized: `linkml-reference-validator --help` shows config options + +## Next Steps + +Now that you're set up: + +1. **Read the Quickstart** - [quickstart.md](quickstart.md) for basic usage +2. **Explore Tutorials** - Work through the Jupyter notebooks in `docs/notebooks/` +3. **Learn Editorial Conventions** - [concepts/editorial-conventions.md](concepts/editorial-conventions.md) for using `[...]` and `...` +4. **Review How-To Guides** - Specific recipes for common tasks +5. **Check out the CLI Reference** - [reference/cli.md](reference/cli.md) for all commands + +## Getting Help + +If you encounter issues: + +1. **Check the documentation** - Most common questions are covered +2. **Search existing issues** - https://github.com/linkml/linkml-reference-validator/issues +3. **Ask for help** - Create a new issue with: + - Your command + - Expected behavior + - Actual behavior + - Schema and data samples (if applicable) +4. **Join the community** - LinkML discussions on GitHub + +## Troubleshooting + +See the [Troubleshooting Guide](troubleshooting.md) for common issues and solutions. + +### Quick Fixes + +**"Command not found: linkml-reference-validator"** +```bash +# Ensure it's installed +pip install linkml-reference-validator + +# Check if it's in PATH +which linkml-reference-validator + +# Use full path if needed +python -m linkml_reference_validator --help +``` + +**"Could not fetch reference: PMID:12345678"** +```bash +# Check internet connection +ping www.ncbi.nlm.nih.gov + +# Verify PMID exists +# Visit: https://pubmed.ncbi.nlm.nih.gov/12345678/ + +# Set email for NCBI (required for API access) +export NCBI_EMAIL="your.email@example.com" +``` + +**"Permission denied: references_cache/"** +```bash +# Check directory permissions +ls -ld references_cache/ + +# Create with proper permissions +mkdir -p references_cache +chmod 755 references_cache +``` + +**"Validation failed but text is in the paper"** +- Check if only abstract was fetched (full text may be in PMC) +- Use PMC ID instead: `PMC:3458566` +- Or use a local file with full text: `file:./paper.md` +- See [repair-validation-errors.md](how-to/repair-validation-errors.md) diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000..1b28085 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,706 @@ +# Troubleshooting Guide + +This guide covers common issues and their solutions when using linkml-reference-validator. + +## Installation Issues + +### Command not found: linkml-reference-validator + +**Symptom:** +```bash +$ linkml-reference-validator --help +bash: linkml-reference-validator: command not found +``` + +**Causes:** +- Package not installed +- Package installed but not in PATH +- Using wrong Python environment + +**Solutions:** + +1. **Verify installation:** +```bash +pip list | grep linkml-reference-validator +``` + +2. **Reinstall if missing:** +```bash +pip install linkml-reference-validator +``` + +3. **Check if it's in PATH:** +```bash +which linkml-reference-validator +``` + +4. **Use module invocation:** +```bash +python -m linkml_reference_validator --help +``` + +5. **Check Python environment:** +```bash +# Show current Python +which python +python --version + +# If using virtual environment +source venv/bin/activate # Linux/Mac +venv\Scripts\activate # Windows +``` + +### ImportError: No module named 'linkml_reference_validator' + +**Symptom:** +```python +ImportError: No module named 'linkml_reference_validator' +``` + +**Solutions:** + +1. **Install in correct environment:** +```bash +# Check current environment +python -c "import sys; print(sys.executable)" + +# Install in that environment +python -m pip install linkml-reference-validator +``` + +2. **Verify installation:** +```python +python -c "import linkml_reference_validator; print(linkml_reference_validator.__version__)" +``` + +### Version conflicts + +**Symptom:** +``` +ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. +``` + +**Solutions:** + +1. **Use uv (recommended):** +```bash +curl -LsSf https://astral.sh/uv/install.sh | sh +uv pip install linkml-reference-validator +``` + +2. **Create fresh virtual environment:** +```bash +python -m venv fresh_env +source fresh_env/bin/activate +pip install --upgrade pip +pip install linkml-reference-validator +``` + +3. **Use compatible versions:** +```bash +pip install linkml-reference-validator --upgrade +``` + +## Reference Fetching Issues + +### Could not fetch reference: PMID:XXXXXXXX + +**Symptom:** +``` +Error: Could not fetch reference PMID:12345678 +Failed to retrieve reference content +``` + +**Causes:** +- PMID doesn't exist +- Network connectivity issues +- NCBI API temporarily unavailable +- Rate limiting +- Missing NCBI email configuration + +**Solutions:** + +1. **Verify PMID exists:** + - Visit https://pubmed.ncbi.nlm.nih.gov/12345678/ + - Check if the number is correct + +2. **Check network connectivity:** +```bash +ping www.ncbi.nlm.nih.gov +curl -I https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi +``` + +3. **Set NCBI email (required):** +```bash +export NCBI_EMAIL="your.email@example.com" +``` + +4. **Get NCBI API key for higher limits:** + - Visit https://www.ncbi.nlm.nih.gov/account/ + - Generate API key + - Set environment variable: +```bash +export NCBI_API_KEY="your_api_key_here" +``` + +5. **Retry after delay:** +```bash +# Wait a moment and try again +sleep 5 +linkml-reference-validator validate text "quote" PMID:12345678 +``` + +6. **Check cache directory permissions:** +```bash +ls -ld references_cache/ +chmod 755 references_cache/ +``` + +### No content available for reference + +**Symptom:** +``` +Error: No content available for PMID:12345678 +Content type: unavailable +``` + +**Causes:** +- Abstract not available +- Article behind paywall (no PMC access) +- Retracted article +- Very old article +- Article not yet indexed + +**Solutions:** + +1. **Try PMC version:** +```bash +# Search for PMC ID at https://www.ncbi.nlm.nih.gov/pmc/ +linkml-reference-validator validate text "quote" PMC:3458566 +``` + +2. **Use DOI instead:** +```bash +linkml-reference-validator validate text "quote" DOI:10.1038/nature12373 +``` + +3. **Use local file:** +```bash +# Save article content as markdown or text +linkml-reference-validator validate text "quote" file:./papers/article.md +``` + +4. **Check cache file:** +```bash +# See what was actually fetched +cat references_cache/PMID_12345678.md +``` + +### Rate limiting errors + +**Symptom:** +``` +Error: Too many requests to NCBI API +HTTP Error 429: Too Many Requests +``` + +**Solutions:** + +1. **Set NCBI API key:** +```bash +export NCBI_API_KEY="your_api_key" +``` +Without key: 3 requests/second +With key: 10 requests/second + +2. **Pre-cache references:** +```bash +# Cache all references before validation +for pmid in PMID:111 PMID:222 PMID:333; do + linkml-reference-validator cache reference $pmid + sleep 1 # Add delay between requests +done +``` + +3. **Use cached references:** +```bash +# If cache exists, no API call is made +linkml-reference-validator validate text "quote" PMID:12345678 \ + --cache-dir ./references_cache +``` + +## Validation Issues + +### Supporting text not found in reference + +**Symptom:** +``` +Error: Supporting text not found in reference +Text part not found as substring: "your quote here" +``` + +**Causes:** +- Quote is paraphrased, not exact +- Text only in figures/tables/supplementary materials +- Text uses different terminology in reference +- Unicode/character differences +- Only abstract available (text in full text) + +**Solutions:** + +1. **Verify exact quote:** + - Open the PDF or HTML of the article + - Copy the exact text + - Check for character differences (O2 vs O₂, α vs alpha) + +2. **Check content type:** +```bash +linkml-reference-validator cache reference PMID:12345678 +# Look for "Content type: abstract_only" vs "full_text_xml" +``` + +3. **Try PMC for full text:** +```bash +# If only abstract was fetched +linkml-reference-validator validate text "quote" PMC:3458566 +``` + +4. **Use repair command:** +```bash +linkml-reference-validator repair text \ + "your quote here" \ + PMID:12345678 +``` + +5. **Add editorial notes:** +```yaml +# If you need to clarify or modernize +supporting_text: "protein [X] functions in cells" +``` + +6. **Use ellipsis for non-contiguous text:** +```yaml +supporting_text: "protein functions ... in cell regulation" +``` + +7. **Check normalization:** +```python +# Test what the text looks like after normalization +from linkml_reference_validator.validation.supporting_text_validator import normalize_text + +text = "Your quote here" +print(normalize_text(text)) +``` + +### Query is empty after removing brackets + +**Symptom:** +``` +Error: Query is empty after removing brackets +Supporting text: "[editorial note]" +``` + +**Cause:** +- Entire supporting_text is in brackets + +**Solution:** + +Include actual quote text: +```yaml +# Wrong +supporting_text: "[sic]" + +# Correct +supporting_text: "protein functions in cells [sic]" +``` + +### Title validation failed + +**Symptom:** +``` +Error: Reference title mismatch +Expected: "Study of Protein X" +Actual: "Study of protein X function" +``` + +**Causes:** +- Title in data doesn't match fetched title +- Partial title provided +- Capitalization differences + +**Solutions:** + +1. **Use exact title:** +```bash +# Fetch reference to see actual title +linkml-reference-validator cache reference PMID:12345678 +cat references_cache/PMID_12345678.md | head -20 +``` + +2. **Omit title if uncertain:** +```yaml +# Title validation is optional +reference_id: PMID:12345678 +# Don't include reference_title if unsure +supporting_text: "your quote" +``` + +3. **Understand title matching:** + - Titles must match completely (not substring) + - Case and punctuation are normalized + - But all words must match + +```yaml +# These match (after normalization): +reference_title: "Role of JAK1 in Cell-Signaling" +actual_title: "Role of JAK1 in Cell Signaling" + +# These DON'T match (partial): +reference_title: "Role of JAK1" +actual_title: "Role of JAK1 in Cell Signaling" +``` + +## Schema Issues + +### No reference or supporting_text fields found + +**Symptom:** +``` +Error: Could not find fields marked with linkml:authoritative_reference or linkml:excerpt +``` + +**Causes:** +- Schema doesn't have required slot_uri markers +- Using wrong field names +- Schema not properly configured + +**Solutions:** + +1. **Add slot_uri markers:** +```yaml +classes: + Evidence: + attributes: + reference: + slot_uri: linkml:authoritative_reference # Required + supporting_text: + slot_uri: linkml:excerpt # Required +``` + +2. **Or use implements:** +```yaml +classes: + Evidence: + attributes: + reference: + implements: + - linkml:authoritative_reference + supporting_text: + implements: + - linkml:excerpt +``` + +3. **Or use standard field names:** + - `reference`, `reference_id`, `pmid` for references + - `supporting_text`, `excerpt`, `quote` for text + +### Schema validation errors + +**Symptom:** +``` +LinkML schema validation failed +``` + +**Solutions:** + +1. **Validate schema separately:** +```bash +linkml-validate --schema schema.yaml schema.yaml +``` + +2. **Check required fields:** +```yaml +prefixes: + linkml: https://w3id.org/linkml/ # Must be defined + +classes: + MyClass: + tree_root: true # At least one class needs this +``` + +3. **Fix common issues:** +```yaml +# Bad: missing range +reference: + required: true + +# Good: includes range +reference: + required: true + range: string +``` + +## Data Format Issues + +### YAML parsing errors + +**Symptom:** +``` +yaml.scanner.ScannerError: mapping values are not allowed here +``` + +**Solutions:** + +1. **Check YAML syntax:** +```bash +# Use YAML validator +python -c "import yaml; yaml.safe_load(open('data.yaml'))" +``` + +2. **Common YAML mistakes:** + +```yaml +# Bad: missing quotes +supporting_text: Text with: colon + +# Good: quoted +supporting_text: "Text with: colon" + +# Bad: incorrect indentation +evidence: + reference: PMID:123 +supporting_text: "text" + +# Good: proper indentation +evidence: + reference: PMID:123 + supporting_text: "text" +``` + +3. **Use YAML linter:** +```bash +pip install yamllint +yamllint data.yaml +``` + +### Invalid reference ID format + +**Symptom:** +``` +Error: Invalid reference ID format: "invalid_id" +``` + +**Solutions:** + +Use correct format: +```yaml +# Correct formats: +reference_id: PMID:12345678 +reference_id: PMC:3458566 +reference_id: DOI:10.1038/nature12373 +reference_id: file:./path/to/file.md +reference_id: url:https://example.org/article + +# Incorrect: +reference_id: 12345678 # Missing PMID: prefix +reference_id: www.example.org # Missing url: prefix +reference_id: ./file.md # Missing file: prefix +``` + +## Performance Issues + +### Validation is very slow + +**Symptom:** +Validation takes minutes instead of seconds + +**Causes:** +- References not cached +- Network latency +- Large number of references +- Fetching full text for each validation + +**Solutions:** + +1. **Pre-cache references:** +```bash +# Extract all PMIDs from data +grep -r "PMID:" data/ | grep -o "PMID:[0-9]*" | sort -u > pmids.txt + +# Cache all +while read pmid; do + linkml-reference-validator cache reference "$pmid" +done < pmids.txt +``` + +2. **Use global cache:** +```bash +export REFERENCE_CACHE_DIR=~/.cache/linkml-reference-validator +``` + +3. **Use verbose mode to identify bottlenecks:** +```bash +linkml-reference-validator validate data data.yaml \ + --schema schema.yaml \ + --verbose +``` + +4. **Check cache hits:** +```bash +# Cached validations should be <100ms +# First fetch will be 2-3 seconds +``` + +### Large cache directory + +**Symptom:** +```bash +du -sh references_cache/ +500M references_cache/ +``` + +**Solutions:** + +1. **Clean old entries:** +```bash +# Remove cache entries older than 30 days +find references_cache/ -name "*.md" -mtime +30 -delete +``` + +2. **Use selective caching:** +```bash +# Cache only what you need +# Don't cache during experimentation +``` + +3. **Compress cache:** +```bash +tar -czf references_cache_backup.tar.gz references_cache/ +rm -rf references_cache/ +``` + +## Common Error Messages + +### "Text normalization resulted in empty string" + +**Cause:** +Text only contains punctuation or whitespace + +**Solution:** +```yaml +# Bad +supporting_text: "..." + +# Good +supporting_text: "text content ... more text" +``` + +### "Multiple reference fields found" + +**Cause:** +Schema has multiple fields marked as authoritative_reference + +**Solution:** +Only mark one field per class: +```yaml +# Bad +attributes: + pmid: + slot_uri: linkml:authoritative_reference + doi: + slot_uri: linkml:authoritative_reference + +# Good - use one field that can hold different types +attributes: + reference_id: + slot_uri: linkml:authoritative_reference +``` + +### "Reference base directory not found" + +**Cause:** +Using `file:` references but base directory not configured + +**Solution:** +```yaml +# In .linkml-reference-validator.yaml +validation: + reference_base_dir: ./references + +# Or use absolute paths +reference_id: file:/full/path/to/file.md +``` + +## Getting More Help + +### Enable verbose logging + +```bash +linkml-reference-validator validate text \ + "quote" PMID:12345678 \ + --verbose +``` + +### Check cache contents + +```bash +# View cached reference +cat references_cache/PMID_12345678.md + +# Check cache metadata +head -n 20 references_cache/PMID_12345678.md +``` + +### Test with simple example + +```bash +# Known working example +linkml-reference-validator validate text \ + "MUC1 oncoprotein blocks nuclear targeting of c-Abl" \ + PMID:16888623 +``` + +### Report bugs + +If you've found a bug: + +1. **Check existing issues:** + https://github.com/linkml/linkml-reference-validator/issues + +2. **Create minimal reproduction:** +```bash +# Simplest possible command that shows the issue +linkml-reference-validator validate text "test" PMID:12345678 --verbose +``` + +3. **Include:** + - Command you ran + - Expected behavior + - Actual behavior + - Error messages (full output) + - Schema (if applicable) + - Data file (if applicable, minimal example) + - Version: `linkml-reference-validator --version` + - Python version: `python --version` + - OS: `uname -a` (Linux/Mac) or `ver` (Windows) + +## Quick Diagnostic Checklist + +Run through this checklist when encountering issues: + +- [ ] Installation successful: `linkml-reference-validator --version` +- [ ] Network accessible: `ping www.ncbi.nlm.nih.gov` +- [ ] NCBI email set: `echo $NCBI_EMAIL` +- [ ] Cache directory writable: `touch references_cache/test && rm references_cache/test` +- [ ] Schema valid: `linkml-validate --schema schema.yaml schema.yaml` +- [ ] Data valid YAML: `python -c "import yaml; yaml.safe_load(open('data.yaml'))"` +- [ ] Reference exists: Visit PubMed URL for the PMID +- [ ] Simple test works: Validate known-good example + +## See Also + +- [Setup Guide](setup-guide.md) - Initial installation and configuration +- [Quickstart](quickstart.md) - Basic usage examples +- [CLI Reference](reference/cli.md) - Complete command documentation +- [How to Repair Validation Errors](how-to/repair-validation-errors.md) - Fixing common issues +- [GitHub Issues](https://github.com/linkml/linkml-reference-validator/issues) - Report bugs diff --git a/docs/tutorials/complete-workflow.md b/docs/tutorials/complete-workflow.md new file mode 100644 index 0000000..6263b6e --- /dev/null +++ b/docs/tutorials/complete-workflow.md @@ -0,0 +1,900 @@ +# Complete Workflow Tutorial: Building a Validated Gene Annotation System + +This tutorial walks you through building a complete gene annotation validation system from scratch, using real examples and best practices. + +## What We'll Build + +A validated gene function annotation system that: +- Stores gene function claims with supporting text from publications +- Automatically validates that quotes match their cited sources +- Supports multiple reference types (PMID, DOI, PMC) +- Includes repair capabilities for common errors +- Can be integrated into a CI/CD pipeline + +**Time required:** 30-45 minutes + +## Prerequisites + +- Python 3.10+ installed +- Basic understanding of YAML +- Familiarity with command line +- (Optional) NCBI API key for higher rate limits + +## Step 1: Installation and Setup (5 minutes) + +### Install the Tool + +```bash +# Using pip +pip install linkml-reference-validator + +# Or using uv (faster) +curl -LsSf https://astral.sh/uv/install.sh | sh +uv pip install linkml-reference-validator +``` + +### Create Project Structure + +```bash +# Create project directory +mkdir gene-annotation-validator +cd gene-annotation-validator + +# Create subdirectories +mkdir -p schemas data references_cache tests + +# Verify installation +linkml-reference-validator --version +``` + +### Configure NCBI Access (Optional) + +```bash +# Set environment variables +export NCBI_EMAIL="your.email@example.com" + +# Test with a simple validation +linkml-reference-validator validate text \ + "MUC1 oncoprotein blocks nuclear targeting of c-Abl" \ + PMID:16888623 +``` + +Expected output: +``` +Validating text against PMID:16888623... +Result: + Valid: True + Message: Supporting text validated successfully in PMID:16888623 +``` + +## Step 2: Design Your Data Model (10 minutes) + +### Create the LinkML Schema + +We'll create a schema for gene function annotations with evidence from literature. + +**schemas/gene_annotations.yaml:** +```yaml +id: https://example.org/gene-annotations +name: gene-annotations +description: Schema for validated gene function annotations + +prefixes: + linkml: https://w3id.org/linkml/ + dcterms: http://purl.org/dc/terms/ + biolink: https://w3id.org/biolink/vocab/ + +default_prefix: gene_annotations + +classes: + # Root container class + GeneAnnotationCollection: + tree_root: true + description: Collection of gene function annotations + attributes: + annotations: + multivalued: true + range: GeneAnnotation + description: List of gene annotations + + # Main annotation class + GeneAnnotation: + description: An annotation describing a gene's function with supporting evidence + attributes: + id: + identifier: true + required: true + description: Unique identifier for this annotation + + gene_symbol: + required: true + description: Official gene symbol (e.g., TP53, BRCA1) + pattern: "^[A-Z0-9]+$" + + gene_name: + description: Full gene name + + function_summary: + required: true + description: Brief summary of the gene's function + + function_category: + range: FunctionCategory + description: Broad categorization of gene function + + species: + range: Species + description: Species this annotation applies to + required: true + + evidence: + required: true + multivalued: true + range: Evidence + description: Supporting evidence from literature + + last_reviewed: + range: date + description: Date this annotation was last reviewed + + curator: + description: Person who created/reviewed this annotation + + # Evidence class with reference validation + Evidence: + description: Evidence supporting a gene function claim + attributes: + reference_id: + required: true + slot_uri: linkml:authoritative_reference + description: | + Reference identifier (PMID, PMC, DOI, or file path) + Examples: PMID:16888623, PMC:3458566, DOI:10.1038/nature12373 + + reference_title: + slot_uri: dcterms:title + description: Title of the referenced publication (validated if provided) + + supporting_text: + required: true + slot_uri: linkml:excerpt + description: | + Direct quote from the reference supporting the annotation. + Use [brackets] for editorial clarifications. + Use ... for omitted text between parts. + + evidence_type: + range: EvidenceType + description: Type of experimental evidence + + confidence: + range: ConfidenceLevel + description: Curator's confidence in this evidence + + notes: + description: Additional context or clarifications + +# Enumerations +enums: + FunctionCategory: + permissible_values: + TUMOR_SUPPRESSOR: + description: Prevents uncontrolled cell growth + ONCOGENE: + description: Promotes cell growth and division + DNA_REPAIR: + description: Repairs damaged DNA + TRANSCRIPTION_FACTOR: + description: Regulates gene expression + CELL_CYCLE_REGULATOR: + description: Controls cell cycle progression + KINASE: + description: Phosphorylates other proteins + PHOSPHATASE: + description: Removes phosphate groups + RECEPTOR: + description: Receives extracellular signals + SIGNALING: + description: Transmits cellular signals + + EvidenceType: + permissible_values: + EXPERIMENTAL: + description: Direct experimental evidence + COMPUTATIONAL: + description: Computational prediction or inference + LITERATURE: + description: Statement from literature without original data + CURATOR_INFERENCE: + description: Inferred by curator from related evidence + + ConfidenceLevel: + permissible_values: + HIGH: + description: Strong, consistent evidence + MEDIUM: + description: Good evidence but some uncertainty + LOW: + description: Limited or conflicting evidence + + Species: + permissible_values: + HUMAN: + description: Homo sapiens + MOUSE: + description: Mus musculus + RAT: + description: Rattus norvegicus + YEAST: + description: Saccharomyces cerevisiae +``` + +### Understanding the Schema + +Key elements: +- **`slot_uri: linkml:excerpt`** - Marks `supporting_text` for validation +- **`slot_uri: linkml:authoritative_reference`** - Marks `reference_id` as the reference +- **`slot_uri: dcterms:title`** - Optionally validates reference titles +- **Enumerations** - Controlled vocabularies for consistency +- **Required fields** - Ensures data completeness + +## Step 3: Create Sample Data (10 minutes) + +### Example 1: Simple Annotation + +**data/tp53_annotation.yaml:** +```yaml +annotations: + - id: ANN001 + gene_symbol: TP53 + gene_name: Tumor protein p53 + function_summary: Regulates cell cycle and acts as tumor suppressor + function_category: TUMOR_SUPPRESSOR + species: HUMAN + curator: Jane Doe + last_reviewed: 2024-01-15 + + evidence: + - reference_id: PMID:16888623 + reference_title: MUC1 oncoprotein blocks nuclear targeting of c-Abl + supporting_text: "MUC1 oncoprotein blocks nuclear targeting of c-Abl" + evidence_type: EXPERIMENTAL + confidence: HIGH +``` + +### Example 2: Multiple Evidence Items + +**data/brca1_annotation.yaml:** +```yaml +annotations: + - id: ANN002 + gene_symbol: BRCA1 + gene_name: Breast cancer type 1 susceptibility protein + function_summary: Critical role in DNA repair and tumor suppression + function_category: DNA_REPAIR + species: HUMAN + curator: John Smith + last_reviewed: 2024-02-20 + + evidence: + # Evidence 1: DNA repair function + - reference_id: PMID:12345678 + supporting_text: "BRCA1 plays a critical role in DNA double-strand break repair" + evidence_type: EXPERIMENTAL + confidence: HIGH + notes: Direct experimental demonstration + + # Evidence 2: Tumor suppressor function + - reference_id: PMID:23456789 + supporting_text: "BRCA1 functions as a tumor suppressor ... maintaining genomic stability" + evidence_type: EXPERIMENTAL + confidence: HIGH + notes: Used ellipsis to connect non-contiguous parts + + # Evidence 3: Using editorial notes + - reference_id: PMC:3458566 + supporting_text: "BRCA1 [breast cancer type 1] is involved in homologous recombination" + evidence_type: LITERATURE + confidence: MEDIUM + notes: Added gene name clarification in brackets +``` + +### Example 3: Mixed Reference Types + +**data/multi_gene_annotations.yaml:** +```yaml +annotations: + - id: ANN003 + gene_symbol: EGFR + gene_name: Epidermal growth factor receptor + function_summary: Receptor tyrosine kinase involved in cell proliferation + function_category: RECEPTOR + species: HUMAN + curator: Jane Doe + + evidence: + # Using DOI + - reference_id: DOI:10.1038/nature12373 + supporting_text: "EGFR is a receptor tyrosine kinase" + evidence_type: EXPERIMENTAL + confidence: HIGH + + # Using local file + - reference_id: file:./references/egfr_review.md + supporting_text: "EGFR mutations are found in many cancers" + evidence_type: LITERATURE + confidence: MEDIUM + notes: From local review article + + - id: ANN004 + gene_symbol: JAK1 + gene_name: Janus kinase 1 + function_summary: Tyrosine kinase in cytokine signaling + function_category: KINASE + species: HUMAN + curator: John Smith + + evidence: + # Using URL + - reference_id: url:https://example.org/jak1-article.html + supporting_text: "JAK1 is a key mediator of cytokine signaling" + evidence_type: LITERATURE + confidence: MEDIUM +``` + +## Step 4: Validate Your Data (10 minutes) + +### Basic Validation + +```bash +# Validate single file +linkml-reference-validator validate data \ + data/tp53_annotation.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection + +# Expected output: +# Validating data/tp53_annotation.yaml... +# ✓ All validations passed! +``` + +### Verbose Validation + +```bash +# See detailed validation info +linkml-reference-validator validate data \ + data/brca1_annotation.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection \ + --verbose + +# Shows: +# - Each reference being validated +# - What text is being searched for +# - Whether full text or abstract was used +# - Validation results for each item +``` + +### Batch Validation + +```bash +# Validate all files in data directory +for file in data/*.yaml; do + echo "Validating $file..." + linkml-reference-validator validate data \ + "$file" \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection +done +``` + +## Step 5: Handle Validation Errors (10 minutes) + +### Scenario 1: Character Encoding Issues + +Create a file with common encoding issues: + +**data/error_example1.yaml:** +```yaml +annotations: + - id: ANN005 + gene_symbol: TEST1 + function_summary: Test gene for CO2 transport + function_category: SIGNALING + species: HUMAN + + evidence: + - reference_id: PMID:16888623 + # This will fail: ASCII "O2" instead of subscript + supporting_text: "protein involved in O2 transport" + evidence_type: EXPERIMENTAL + confidence: HIGH +``` + +Validate and repair: + +```bash +# First validate to see the error +linkml-reference-validator validate data \ + data/error_example1.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection + +# Use repair to fix (dry run first) +linkml-reference-validator repair data \ + data/error_example1.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection \ + --dry-run + +# Review the suggested fixes, then apply +linkml-reference-validator repair data \ + data/error_example1.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection \ + --no-dry-run +``` + +### Scenario 2: Missing Ellipsis + +**data/error_example2.yaml:** +```yaml +annotations: + - id: ANN006 + gene_symbol: TEST2 + function_summary: Test gene + function_category: SIGNALING + species: HUMAN + + evidence: + - reference_id: PMID:16888623 + # This will fail: missing "..." between non-contiguous parts + supporting_text: "MUC1 oncoprotein blocks c-Abl" + evidence_type: EXPERIMENTAL + confidence: HIGH +``` + +The repair command will suggest adding ellipsis: +``` +Suggested fix (MEDIUM confidence): + "MUC1 oncoprotein blocks c-Abl" → "MUC1 oncoprotein ... blocks ... c-Abl" +``` + +### Scenario 3: Text Not in Reference + +**data/error_example3.yaml:** +```yaml +annotations: + - id: ANN007 + gene_symbol: TEST3 + function_summary: Test gene + function_category: SIGNALING + species: HUMAN + + evidence: + - reference_id: PMID:16888623 + # This will fail: text doesn't exist in reference + supporting_text: "completely fabricated text that doesn't exist" + evidence_type: EXPERIMENTAL + confidence: HIGH +``` + +The repair command will flag for removal: +``` +RECOMMENDED REMOVALS (low confidence): + PMID:16888623 at evidence[0]: + Similarity: 5% + Snippet: 'completely fabricated text that doesn't exist' + Action: Remove or find correct reference +``` + +## Step 6: Create Configuration File (5 minutes) + +Create a project configuration: + +**.linkml-reference-validator.yaml:** +```yaml +# Validation configuration +validation: + cache_dir: ./references_cache + + # Custom prefix mappings + reference_prefix_map: + pubmed: PMID + pmc: PMC + doi: DOI + + # Base directory for file:// references + reference_base_dir: ./references + +# Repair configuration +repair: + # Confidence thresholds + auto_fix_threshold: 0.95 + suggest_threshold: 0.80 + removal_threshold: 0.50 + + # Character normalization + character_mappings: + "O2": "O₂" + "CO2": "CO₂" + "H2O": "H₂O" + "N2": "N₂" + "+/-": "±" + "alpha": "α" + "beta": "β" + "gamma": "γ" + + # Skip references with known issues + skip_references: [] + + # Trusted references (manually verified) + trusted_low_similarity: [] +``` + +Use the configuration: + +```bash +linkml-reference-validator validate data \ + data/*.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection \ + --config .linkml-reference-validator.yaml +``` + +## Step 7: Integrate with Version Control (5 minutes) + +### Create Git Pre-commit Hook + +**.git/hooks/pre-commit:** +```bash +#!/bin/bash + +echo "🔍 Validating gene annotations..." + +# Validate all data files +for file in data/*.yaml; do + if [ -f "$file" ]; then + echo " Checking $file..." + + linkml-reference-validator validate data \ + "$file" \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection \ + --config .linkml-reference-validator.yaml + + if [ $? -ne 0 ]; then + echo "❌ Validation failed for $file" + echo "" + echo "To fix errors, run:" + echo " linkml-reference-validator repair data $file --schema schemas/gene_annotations.yaml --dry-run" + exit 1 + fi + fi +done + +echo "✅ All validations passed!" +exit 0 +``` + +Make it executable: +```bash +chmod +x .git/hooks/pre-commit +``` + +### Create Makefile + +**Makefile:** +```makefile +.PHONY: validate validate-verbose repair clean test + +SCHEMA := schemas/gene_annotations.yaml +DATA_DIR := data +CONFIG := .linkml-reference-validator.yaml +TARGET_CLASS := GeneAnnotationCollection + +# Validate all data files +validate: + @echo "Validating all annotations..." + @for file in $(DATA_DIR)/*.yaml; do \ + echo "Checking $$file..."; \ + linkml-reference-validator validate data \ + $$file \ + --schema $(SCHEMA) \ + --target-class $(TARGET_CLASS) \ + --config $(CONFIG) || exit 1; \ + done + @echo "✅ All validations passed!" + +# Validate with verbose output +validate-verbose: + @for file in $(DATA_DIR)/*.yaml; do \ + echo "Checking $$file..."; \ + linkml-reference-validator validate data \ + $$file \ + --schema $(SCHEMA) \ + --target-class $(TARGET_CLASS) \ + --config $(CONFIG) \ + --verbose; \ + done + +# Show suggested repairs (dry run) +repair: + @for file in $(DATA_DIR)/*.yaml; do \ + echo "Checking repairs for $$file..."; \ + linkml-reference-validator repair data \ + $$file \ + --schema $(SCHEMA) \ + --target-class $(TARGET_CLASS) \ + --config $(CONFIG) \ + --dry-run; \ + done + +# Apply repairs +repair-apply: + @for file in $(DATA_DIR)/*.yaml; do \ + echo "Applying repairs to $$file..."; \ + linkml-reference-validator repair data \ + $$file \ + --schema $(SCHEMA) \ + --target-class $(TARGET_CLASS) \ + --config $(CONFIG) \ + --no-dry-run; \ + done + +# Clean cache +clean: + rm -rf references_cache/ + +# Run tests +test: validate + @echo "Running tests..." + @python -m pytest tests/ -v +``` + +Usage: +```bash +make validate # Validate all files +make validate-verbose # Verbose output +make repair # Show suggested repairs +make repair-apply # Apply repairs +make clean # Clear cache +``` + +## Step 8: CI/CD Integration + +### GitHub Actions + +**.github/workflows/validate-annotations.yml:** +```yaml +name: Validate Gene Annotations + +on: + push: + branches: [ main, develop ] + paths: + - 'data/**.yaml' + - 'schemas/**.yaml' + pull_request: + branches: [ main ] + paths: + - 'data/**.yaml' + - 'schemas/**.yaml' + +jobs: + validate: + runs-on: ubuntu-latest + + steps: + - name: Checkout code + uses: actions/checkout@v3 + + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: '3.11' + + - name: Install dependencies + run: | + pip install linkml-reference-validator + + - name: Cache references + uses: actions/cache@v3 + with: + path: references_cache + key: ${{ runner.os }}-references-${{ hashFiles('data/**/*.yaml') }} + restore-keys: | + ${{ runner.os }}-references- + + - name: Validate annotations + run: | + make validate + env: + NCBI_EMAIL: ${{ secrets.NCBI_EMAIL }} + NCBI_API_KEY: ${{ secrets.NCBI_API_KEY }} + + - name: Upload cache artifacts + if: always() + uses: actions/upload-artifact@v3 + with: + name: references-cache + path: references_cache/ + retention-days: 30 +``` + +## Step 9: Testing and Quality Assurance + +### Create Test Files + +**tests/test_validation.py:** +```python +#!/usr/bin/env python3 +"""Test suite for gene annotation validation.""" + +import subprocess +import yaml +from pathlib import Path + +DATA_DIR = Path("data") +SCHEMA = Path("schemas/gene_annotations.yaml") +TARGET_CLASS = "GeneAnnotationCollection" + +def test_schema_valid(): + """Test that schema itself is valid.""" + result = subprocess.run( + ["linkml-validate", "--schema", str(SCHEMA), str(SCHEMA)], + capture_output=True, + text=True + ) + assert result.returncode == 0, f"Schema validation failed: {result.stderr}" + +def test_all_data_files_valid(): + """Test that all data files validate against schema.""" + for data_file in DATA_DIR.glob("*.yaml"): + if "error" in data_file.name: + continue # Skip error example files + + print(f"Testing {data_file}...") + result = subprocess.run( + [ + "linkml-reference-validator", "validate", "data", + str(data_file), + "--schema", str(SCHEMA), + "--target-class", TARGET_CLASS + ], + capture_output=True, + text=True + ) + assert result.returncode == 0, \ + f"Validation failed for {data_file}: {result.stderr}" + +def test_data_completeness(): + """Test that all required fields are present.""" + for data_file in DATA_DIR.glob("*.yaml"): + if "error" in data_file.name: + continue + + with open(data_file) as f: + data = yaml.safe_load(f) + + # Check each annotation + for ann in data.get("annotations", []): + assert "id" in ann, f"Missing id in {data_file}" + assert "gene_symbol" in ann, f"Missing gene_symbol in {data_file}" + assert "evidence" in ann, f"Missing evidence in {data_file}" + + # Check each evidence item + for ev in ann["evidence"]: + assert "reference_id" in ev, f"Missing reference_id in {data_file}" + assert "supporting_text" in ev, f"Missing supporting_text in {data_file}" + +if __name__ == "__main__": + test_schema_valid() + test_all_data_files_valid() + test_data_completeness() + print("✅ All tests passed!") +``` + +Run tests: +```bash +python tests/test_validation.py +``` + +## Step 10: Documentation and Maintenance + +### Create README + +**README.md:** +```markdown +# Gene Annotation Validation System + +Validated gene function annotations with supporting evidence from literature. + +## Quick Start + +```bash +# Validate all annotations +make validate + +# Add new annotation +cp templates/annotation_template.yaml data/new_gene.yaml +# Edit data/new_gene.yaml with your annotation +make validate + +# Repair validation errors +make repair +``` + +## Directory Structure + +``` +. +├── schemas/ +│ └── gene_annotations.yaml # LinkML schema +├── data/ +│ ├── tp53_annotation.yaml # Gene annotations +│ └── ... +├── references_cache/ # Cached references +├── tests/ +│ └── test_validation.py # Test suite +├── .linkml-reference-validator.yaml # Config +└── Makefile # Build commands +``` + +## Contributing + +1. Create new annotation file in `data/` +2. Validate: `make validate` +3. Fix any errors: `make repair` +4. Commit and push (pre-commit hook will validate) +``` + +### Create Template + +**templates/annotation_template.yaml:** +```yaml +annotations: + - id: ANN_XXX # Replace with unique ID + gene_symbol: GENE_SYMBOL # Official gene symbol + gene_name: Full Gene Name + function_summary: Brief summary of function + function_category: CATEGORY # See schema for options + species: HUMAN # Or MOUSE, RAT, YEAST + curator: Your Name + last_reviewed: YYYY-MM-DD + + evidence: + - reference_id: PMID:XXXXXXXX # Or DOI:, PMC:, file:, url: + reference_title: Article title (optional but recommended) + supporting_text: "Direct quote from the reference" + evidence_type: EXPERIMENTAL # Or COMPUTATIONAL, LITERATURE, CURATOR_INFERENCE + confidence: HIGH # Or MEDIUM, LOW + notes: Additional context (optional) +``` + +## Summary + +You've now built a complete gene annotation validation system! You've learned: + +- ✅ How to install and configure linkml-reference-validator +- ✅ How to design a LinkML schema with validation markers +- ✅ How to create validated data files +- ✅ How to validate and repair data +- ✅ How to integrate validation into your workflow +- ✅ How to set up CI/CD for automatic validation +- ✅ How to write tests for your validation system + +## Next Steps + +1. **Expand your schema** - Add more gene attributes, relationships, or evidence types +2. **Import existing data** - Convert existing annotations to your new format +3. **Integrate with databases** - Export validated data to SQL, MongoDB, or RDF +4. **Build a web interface** - Create a UI for curators to add/edit annotations +5. **Set up monitoring** - Track validation success rates and common error patterns + +## Additional Resources + +- [linkml-reference-validator Documentation](https://linkml.github.io/linkml-reference-validator/) +- [LinkML Schema Language](https://linkml.io/) +- [PubMed E-utilities API](https://www.ncbi.nlm.nih.gov/books/NBK25501/) +- [Crossref API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) diff --git a/mkdocs.yml b/mkdocs.yml index 9ad2e88..8667c51 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -24,8 +24,10 @@ plugins: nav: - Home: index.md + - Setup Guide: setup-guide.md - Quickstart: quickstart.md - Tutorials: + - Complete Workflow: tutorials/complete-workflow.md - Getting Started (CLI): notebooks/01_getting_started.ipynb - Advanced Usage (CLI): notebooks/02_advanced_usage.ipynb - Validating OBO Files (CLI): notebooks/04_obo_validation.ipynb @@ -47,6 +49,7 @@ nav: - Editorial Conventions: concepts/editorial-conventions.md - Reference: - CLI Reference: reference/cli.md + - Troubleshooting: troubleshooting.md - Roadmap: todo.md exclude_docs: |