diff --git a/docs/setup-guide.md b/docs/setup-guide.md new file mode 100644 index 0000000..20ce064 --- /dev/null +++ b/docs/setup-guide.md @@ -0,0 +1,699 @@ +# Complete Setup Guide + +This guide walks you through setting up linkml-reference-validator from scratch, with complete examples for different use cases. + +## Prerequisites + +### System Requirements + +- **Python 3.10 or higher** - Check with `python --version` +- **pip or uv** - Package installer (uv is faster and recommended) +- **Internet connection** - For fetching references from PubMed, Crossref, etc. + +### Optional but Recommended + +- **NCBI API Key** - For higher rate limits when fetching PubMed articles +- **Git** - For version control of your data and schemas + +## Installation + +### Option 1: Using pip (Standard) + +```bash +pip install linkml-reference-validator +``` + +Verify the installation: + +```bash +linkml-reference-validator --version +``` + +### Option 2: Using uv (Recommended for Speed) + +[uv](https://github.com/astral-sh/uv) is a fast Python package installer: + +```bash +# Install uv first (if you don't have it) +curl -LsSf https://astral.sh/uv/install.sh | sh + +# Install linkml-reference-validator +uv pip install linkml-reference-validator +``` + +Verify: + +```bash +uv run linkml-reference-validator --version +``` + +### Option 3: Development Installation + +If you want to contribute or modify the code: + +```bash +# Clone the repository +git clone https://github.com/linkml/linkml-reference-validator.git +cd linkml-reference-validator + +# Install with development dependencies +uv sync --group dev + +# Run tests to verify +just test +``` + +## Initial Configuration + +### 1. Set Up Your Workspace + +Create a directory for your validation project: + +```bash +mkdir my-validation-project +cd my-validation-project +``` + +### 2. Configure NCBI Access (Optional but Recommended) + +To avoid rate limits when fetching PubMed articles: + +1. **Get an NCBI API Key** (free): + - Visit https://www.ncbi.nlm.nih.gov/account/ + - Sign up or log in + - Go to Settings → API Key Management + - Create a new API key + +2. **Set environment variables**: + +```bash +# Add to your ~/.bashrc, ~/.zshrc, or ~/.profile +export NCBI_EMAIL="your.email@example.com" +export NCBI_API_KEY="your_api_key_here" + +# Or create a .env file in your project +echo 'NCBI_EMAIL=your.email@example.com' >> .env +echo 'NCBI_API_KEY=your_api_key_here' >> .env +``` + +3. **Test the configuration**: + +```bash +linkml-reference-validator validate text \ + "MUC1 oncoprotein blocks nuclear targeting of c-Abl" \ + PMID:16888623 +``` + +If successful, you should see: +``` +Validating text against PMID:16888623... + Text: MUC1 oncoprotein blocks nuclear targeting of c-Abl + +Result: + Valid: True + Message: Supporting text validated successfully in PMID:16888623 +``` + +### 3. Set Up Cache Directory + +By default, references are cached in `references_cache/` in your current directory. To use a global cache: + +```bash +# Create a global cache directory +mkdir -p ~/.cache/linkml-reference-validator + +# Set environment variable +export REFERENCE_CACHE_DIR=~/.cache/linkml-reference-validator + +# Or add to your shell profile +echo 'export REFERENCE_CACHE_DIR=~/.cache/linkml-reference-validator' >> ~/.bashrc +``` + +Benefits of a global cache: +- Share references across multiple projects +- Avoid re-downloading the same papers +- Faster validation when working on multiple datasets + +## Quick Start Examples + +### Example 1: Validate a Single Quote + +The simplest use case - verify that a quote appears in a paper: + +```bash +linkml-reference-validator validate text \ + "TP53 functions as a tumor suppressor" \ + PMID:12345678 +``` + +**What happens:** +1. Fetches the reference from PubMed +2. Caches it locally in `references_cache/PMID_12345678.md` +3. Searches for the quote in the reference content +4. Returns validation result + +### Example 2: Create Your First Schema + +Create a schema file to define your data structure: + +**schema.yaml:** +```yaml +id: https://example.org/gene-validation +name: gene-validation-schema + +prefixes: + linkml: https://w3id.org/linkml/ + +classes: + GeneAnnotation: + tree_root: true + attributes: + gene_symbol: + required: true + description: Gene symbol (e.g., TP53, BRCA1) + + function: + required: true + description: Functional description of the gene + + supporting_text: + required: true + description: Quote from the reference supporting this annotation + slot_uri: linkml:excerpt + + reference_id: + required: true + description: PubMed ID or DOI of the reference + slot_uri: linkml:authoritative_reference +``` + +**Key points:** +- `slot_uri: linkml:excerpt` marks the field containing quoted text +- `slot_uri: linkml:authoritative_reference` marks the reference identifier +- These special URIs tell the validator which fields to check + +### Example 3: Create Your First Data File + +Create a data file matching your schema: + +**gene_data.yaml:** +```yaml +gene_symbol: TP53 +function: Tumor suppressor that regulates cell cycle +supporting_text: TP53 functions as a tumor suppressor through regulation of cell cycle arrest +reference_id: PMID:12345678 +``` + +### Example 4: Validate Your Data + +```bash +linkml-reference-validator validate data \ + gene_data.yaml \ + --schema schema.yaml \ + --target-class GeneAnnotation +``` + +**Expected output (if valid):** +``` +Validating gene_data.yaml against schema schema.yaml +Cache directory: references_cache + +Validating 1 object(s) of type GeneAnnotation... +✓ All validations passed! +``` + +**Output if validation fails:** +``` +Validating gene_data.yaml against schema schema.yaml +Cache directory: references_cache + +Validating 1 object(s) of type GeneAnnotation... +✗ Validation failed for: + Reference: PMID:12345678 + Supporting text: "TP53 functions as a tumor suppressor through regulation of cell cycle arrest" + Error: Text not found in reference + +1 validation(s) failed, 0 passed +``` + +## Real-World Example: Validating Gene Functions + +Let's work through a complete real-world example: validating gene function annotations. + +### Step 1: Project Setup + +```bash +# Create project directory +mkdir gene-annotations +cd gene-annotations + +# Create subdirectories +mkdir schemas +mkdir data +mkdir references_cache +``` + +### Step 2: Create the Schema + +**schemas/gene_function_schema.yaml:** +```yaml +id: https://example.org/gene-functions +name: gene-functions + +prefixes: + linkml: https://w3id.org/linkml/ + +classes: + GeneFunctionDataset: + tree_root: true + attributes: + genes: + multivalued: true + range: GeneFunction + + GeneFunction: + attributes: + gene_symbol: + identifier: true + required: true + description: Official gene symbol + + function_category: + required: true + description: Broad category of function + range: FunctionCategory + + detailed_function: + required: true + description: Detailed description of function + + evidence: + required: true + range: Evidence + description: Supporting evidence from literature + + Evidence: + attributes: + reference_id: + required: true + slot_uri: linkml:authoritative_reference + description: PMID, DOI, or PMC identifier + + supporting_text: + required: true + slot_uri: linkml:excerpt + description: Direct quote from the reference + + notes: + description: Additional context or clarifications + +enums: + FunctionCategory: + permissible_values: + TUMOR_SUPPRESSOR: + description: Prevents uncontrolled cell growth + ONCOGENE: + description: Promotes cell growth and division + DNA_REPAIR: + description: Repairs damaged DNA + TRANSCRIPTION_FACTOR: + description: Regulates gene expression + CELL_CYCLE_REGULATOR: + description: Controls cell cycle progression +``` + +### Step 3: Create Sample Data + +**data/tp53_brca1.yaml:** +```yaml +genes: + - gene_symbol: TP53 + function_category: TUMOR_SUPPRESSOR + detailed_function: Regulates cell cycle arrest and apoptosis in response to DNA damage + evidence: + reference_id: PMID:16888623 + supporting_text: "MUC1 oncoprotein blocks nuclear targeting of c-Abl" + notes: Example from actual paper + + - gene_symbol: BRCA1 + function_category: DNA_REPAIR + detailed_function: Critical role in homologous recombination DNA repair + evidence: + reference_id: PMID:12345678 + supporting_text: "BRCA1 plays a critical role in DNA double-strand break repair through homologous recombination" +``` + +### Step 4: Validate + +```bash +linkml-reference-validator validate data \ + data/tp53_brca1.yaml \ + --schema schemas/gene_function_schema.yaml \ + --target-class GeneFunctionDataset \ + --verbose +``` + +### Step 5: Handle Validation Errors + +If validation fails, use the repair command: + +```bash +# First, see what repairs are suggested (dry run) +linkml-reference-validator repair data \ + data/tp53_brca1.yaml \ + --schema schemas/gene_function_schema.yaml \ + --target-class GeneFunctionDataset \ + --dry-run + +# Review the suggested repairs, then apply if appropriate +linkml-reference-validator repair data \ + data/tp53_brca1.yaml \ + --schema schemas/gene_function_schema.yaml \ + --target-class GeneFunctionDataset \ + --no-dry-run +``` + +## Advanced Configuration + +### Project Configuration File + +Create `.linkml-reference-validator.yaml` in your project root: + +```yaml +# Validation settings +validation: + # Cache directory (relative to config file or absolute) + cache_dir: ./references_cache + + # Custom reference prefix mappings + reference_prefix_map: + geo: GEO + NCBIGeo: GEO + pubmed: PMID + + # Base directory for resolving file:// references + reference_base_dir: ./references + +# Repair settings +repair: + # Confidence thresholds + auto_fix_threshold: 0.95 + suggest_threshold: 0.80 + removal_threshold: 0.50 + + # Character normalization mappings + character_mappings: + "CO2": "CO₂" + "H2O": "H₂O" + "O2": "O₂" + "+/-": "±" + "+-": "±" + + # Skip certain references + skip_references: + - "PMID:00000000" # Example: no abstract available + + # Trust low-similarity matches (manually verified) + trusted_low_similarity: + - "PMID:99999999" # Example: verified manually +``` + +Use the config file: + +```bash +linkml-reference-validator validate data \ + data.yaml \ + --schema schema.yaml \ + --config .linkml-reference-validator.yaml +``` + +### Using Environment Variables + +Create a `.env` file: + +```bash +# NCBI Configuration +NCBI_EMAIL=your.email@example.com +NCBI_API_KEY=your_api_key_here + +# Cache Configuration +REFERENCE_CACHE_DIR=/path/to/global/cache + +# Rate Limiting (requests per second) +NCBI_RATE_LIMIT=3 +CROSSREF_RATE_LIMIT=2 +``` + +Load environment variables: + +```bash +# Using direnv (recommended) +echo 'dotenv' > .envrc +direnv allow + +# Or manually source +set -a +source .env +set +a +``` + +## Working with Different Reference Types + +### PubMed IDs (PMID) + +```bash +linkml-reference-validator validate text \ + "Your quote here" \ + PMID:16888623 +``` + +### PubMed Central (PMC) + +For full-text access: + +```bash +linkml-reference-validator validate text \ + "Your quote here" \ + PMC:3458566 +``` + +### Digital Object Identifiers (DOI) + +```bash +linkml-reference-validator validate text \ + "Your quote here" \ + DOI:10.1038/nature12373 +``` + +### Local Files + +```bash +# Markdown file +linkml-reference-validator validate text \ + "Your quote here" \ + file:./references/paper1.md + +# Text file +linkml-reference-validator validate text \ + "Your quote here" \ + file:./references/paper1.txt + +# HTML file +linkml-reference-validator validate text \ + "Your quote here" \ + file:./references/paper1.html +``` + +### Web URLs + +```bash +linkml-reference-validator validate text \ + "Your quote here" \ + url:https://example.org/article.html +``` + +## Integration with Existing Workflows + +### Pre-commit Hook + +Add validation to your git pre-commit: + +**.git/hooks/pre-commit:** +```bash +#!/bin/bash + +echo "Running reference validation..." + +linkml-reference-validator validate data \ + data/*.yaml \ + --schema schemas/schema.yaml \ + --target-class Dataset + +if [ $? -ne 0 ]; then + echo "❌ Reference validation failed!" + echo "Run 'linkml-reference-validator repair data ...' to fix errors" + exit 1 +fi + +echo "✅ Reference validation passed!" +``` + +Make it executable: +```bash +chmod +x .git/hooks/pre-commit +``` + +### CI/CD Integration (GitHub Actions) + +**.github/workflows/validate.yml:** +```yaml +name: Validate References + +on: [push, pull_request] + +jobs: + validate: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + + - uses: actions/setup-python@v4 + with: + python-version: '3.11' + + - name: Install dependencies + run: | + pip install linkml-reference-validator + + - name: Validate references + run: | + linkml-reference-validator validate data \ + data/*.yaml \ + --schema schemas/schema.yaml \ + --target-class Dataset + env: + NCBI_EMAIL: ${{ secrets.NCBI_EMAIL }} + NCBI_API_KEY: ${{ secrets.NCBI_API_KEY }} +``` + +### Makefile Integration + +**Makefile:** +```makefile +.PHONY: validate repair clean + +SCHEMA := schemas/schema.yaml +DATA := data/*.yaml +CLASS := Dataset + +validate: + linkml-reference-validator validate data \ + $(DATA) \ + --schema $(SCHEMA) \ + --target-class $(CLASS) + +repair: + linkml-reference-validator repair data \ + $(DATA) \ + --schema $(SCHEMA) \ + --target-class $(CLASS) \ + --dry-run + +repair-apply: + linkml-reference-validator repair data \ + $(DATA) \ + --schema $(SCHEMA) \ + --target-class $(CLASS) \ + --no-dry-run + +clean: + rm -rf references_cache/ +``` + +Usage: +```bash +make validate +make repair +make repair-apply +``` + +## Verification Checklist + +After setup, verify everything works: + +- [ ] Installation successful: `linkml-reference-validator --version` +- [ ] Can fetch PubMed articles: `linkml-reference-validator cache reference PMID:16888623` +- [ ] Can validate text: `linkml-reference-validator validate text "test" PMID:16888623` +- [ ] Schema validates: `linkml-validate --schema schema.yaml data.yaml` +- [ ] Reference validation works: `linkml-reference-validator validate data data.yaml --schema schema.yaml` +- [ ] Cache directory created: `ls -l references_cache/` +- [ ] Configuration file recognized: `linkml-reference-validator --help` shows config options + +## Next Steps + +Now that you're set up: + +1. **Read the Quickstart** - [quickstart.md](quickstart.md) for basic usage +2. **Explore Tutorials** - Work through the Jupyter notebooks in `docs/notebooks/` +3. **Learn Editorial Conventions** - [concepts/editorial-conventions.md](concepts/editorial-conventions.md) for using `[...]` and `...` +4. **Review How-To Guides** - Specific recipes for common tasks +5. **Check out the CLI Reference** - [reference/cli.md](reference/cli.md) for all commands + +## Getting Help + +If you encounter issues: + +1. **Check the documentation** - Most common questions are covered +2. **Search existing issues** - https://github.com/linkml/linkml-reference-validator/issues +3. **Ask for help** - Create a new issue with: + - Your command + - Expected behavior + - Actual behavior + - Schema and data samples (if applicable) +4. **Join the community** - LinkML discussions on GitHub + +## Troubleshooting + +See the [Troubleshooting Guide](troubleshooting.md) for common issues and solutions. + +### Quick Fixes + +**"Command not found: linkml-reference-validator"** +```bash +# Ensure it's installed +pip install linkml-reference-validator + +# Check if it's in PATH +which linkml-reference-validator + +# Use full path if needed +python -m linkml_reference_validator --help +``` + +**"Could not fetch reference: PMID:12345678"** +```bash +# Check internet connection +ping www.ncbi.nlm.nih.gov + +# Verify PMID exists +# Visit: https://pubmed.ncbi.nlm.nih.gov/12345678/ + +# Set email for NCBI (required for API access) +export NCBI_EMAIL="your.email@example.com" +``` + +**"Permission denied: references_cache/"** +```bash +# Check directory permissions +ls -ld references_cache/ + +# Create with proper permissions +mkdir -p references_cache +chmod 755 references_cache +``` + +**"Validation failed but text is in the paper"** +- Check if only abstract was fetched (full text may be in PMC) +- Use PMC ID instead: `PMC:3458566` +- Or use a local file with full text: `file:./paper.md` +- See [repair-validation-errors.md](how-to/repair-validation-errors.md) diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000..1b28085 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,706 @@ +# Troubleshooting Guide + +This guide covers common issues and their solutions when using linkml-reference-validator. + +## Installation Issues + +### Command not found: linkml-reference-validator + +**Symptom:** +```bash +$ linkml-reference-validator --help +bash: linkml-reference-validator: command not found +``` + +**Causes:** +- Package not installed +- Package installed but not in PATH +- Using wrong Python environment + +**Solutions:** + +1. **Verify installation:** +```bash +pip list | grep linkml-reference-validator +``` + +2. **Reinstall if missing:** +```bash +pip install linkml-reference-validator +``` + +3. **Check if it's in PATH:** +```bash +which linkml-reference-validator +``` + +4. **Use module invocation:** +```bash +python -m linkml_reference_validator --help +``` + +5. **Check Python environment:** +```bash +# Show current Python +which python +python --version + +# If using virtual environment +source venv/bin/activate # Linux/Mac +venv\Scripts\activate # Windows +``` + +### ImportError: No module named 'linkml_reference_validator' + +**Symptom:** +```python +ImportError: No module named 'linkml_reference_validator' +``` + +**Solutions:** + +1. **Install in correct environment:** +```bash +# Check current environment +python -c "import sys; print(sys.executable)" + +# Install in that environment +python -m pip install linkml-reference-validator +``` + +2. **Verify installation:** +```python +python -c "import linkml_reference_validator; print(linkml_reference_validator.__version__)" +``` + +### Version conflicts + +**Symptom:** +``` +ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. +``` + +**Solutions:** + +1. **Use uv (recommended):** +```bash +curl -LsSf https://astral.sh/uv/install.sh | sh +uv pip install linkml-reference-validator +``` + +2. **Create fresh virtual environment:** +```bash +python -m venv fresh_env +source fresh_env/bin/activate +pip install --upgrade pip +pip install linkml-reference-validator +``` + +3. **Use compatible versions:** +```bash +pip install linkml-reference-validator --upgrade +``` + +## Reference Fetching Issues + +### Could not fetch reference: PMID:XXXXXXXX + +**Symptom:** +``` +Error: Could not fetch reference PMID:12345678 +Failed to retrieve reference content +``` + +**Causes:** +- PMID doesn't exist +- Network connectivity issues +- NCBI API temporarily unavailable +- Rate limiting +- Missing NCBI email configuration + +**Solutions:** + +1. **Verify PMID exists:** + - Visit https://pubmed.ncbi.nlm.nih.gov/12345678/ + - Check if the number is correct + +2. **Check network connectivity:** +```bash +ping www.ncbi.nlm.nih.gov +curl -I https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi +``` + +3. **Set NCBI email (required):** +```bash +export NCBI_EMAIL="your.email@example.com" +``` + +4. **Get NCBI API key for higher limits:** + - Visit https://www.ncbi.nlm.nih.gov/account/ + - Generate API key + - Set environment variable: +```bash +export NCBI_API_KEY="your_api_key_here" +``` + +5. **Retry after delay:** +```bash +# Wait a moment and try again +sleep 5 +linkml-reference-validator validate text "quote" PMID:12345678 +``` + +6. **Check cache directory permissions:** +```bash +ls -ld references_cache/ +chmod 755 references_cache/ +``` + +### No content available for reference + +**Symptom:** +``` +Error: No content available for PMID:12345678 +Content type: unavailable +``` + +**Causes:** +- Abstract not available +- Article behind paywall (no PMC access) +- Retracted article +- Very old article +- Article not yet indexed + +**Solutions:** + +1. **Try PMC version:** +```bash +# Search for PMC ID at https://www.ncbi.nlm.nih.gov/pmc/ +linkml-reference-validator validate text "quote" PMC:3458566 +``` + +2. **Use DOI instead:** +```bash +linkml-reference-validator validate text "quote" DOI:10.1038/nature12373 +``` + +3. **Use local file:** +```bash +# Save article content as markdown or text +linkml-reference-validator validate text "quote" file:./papers/article.md +``` + +4. **Check cache file:** +```bash +# See what was actually fetched +cat references_cache/PMID_12345678.md +``` + +### Rate limiting errors + +**Symptom:** +``` +Error: Too many requests to NCBI API +HTTP Error 429: Too Many Requests +``` + +**Solutions:** + +1. **Set NCBI API key:** +```bash +export NCBI_API_KEY="your_api_key" +``` +Without key: 3 requests/second +With key: 10 requests/second + +2. **Pre-cache references:** +```bash +# Cache all references before validation +for pmid in PMID:111 PMID:222 PMID:333; do + linkml-reference-validator cache reference $pmid + sleep 1 # Add delay between requests +done +``` + +3. **Use cached references:** +```bash +# If cache exists, no API call is made +linkml-reference-validator validate text "quote" PMID:12345678 \ + --cache-dir ./references_cache +``` + +## Validation Issues + +### Supporting text not found in reference + +**Symptom:** +``` +Error: Supporting text not found in reference +Text part not found as substring: "your quote here" +``` + +**Causes:** +- Quote is paraphrased, not exact +- Text only in figures/tables/supplementary materials +- Text uses different terminology in reference +- Unicode/character differences +- Only abstract available (text in full text) + +**Solutions:** + +1. **Verify exact quote:** + - Open the PDF or HTML of the article + - Copy the exact text + - Check for character differences (O2 vs O₂, α vs alpha) + +2. **Check content type:** +```bash +linkml-reference-validator cache reference PMID:12345678 +# Look for "Content type: abstract_only" vs "full_text_xml" +``` + +3. **Try PMC for full text:** +```bash +# If only abstract was fetched +linkml-reference-validator validate text "quote" PMC:3458566 +``` + +4. **Use repair command:** +```bash +linkml-reference-validator repair text \ + "your quote here" \ + PMID:12345678 +``` + +5. **Add editorial notes:** +```yaml +# If you need to clarify or modernize +supporting_text: "protein [X] functions in cells" +``` + +6. **Use ellipsis for non-contiguous text:** +```yaml +supporting_text: "protein functions ... in cell regulation" +``` + +7. **Check normalization:** +```python +# Test what the text looks like after normalization +from linkml_reference_validator.validation.supporting_text_validator import normalize_text + +text = "Your quote here" +print(normalize_text(text)) +``` + +### Query is empty after removing brackets + +**Symptom:** +``` +Error: Query is empty after removing brackets +Supporting text: "[editorial note]" +``` + +**Cause:** +- Entire supporting_text is in brackets + +**Solution:** + +Include actual quote text: +```yaml +# Wrong +supporting_text: "[sic]" + +# Correct +supporting_text: "protein functions in cells [sic]" +``` + +### Title validation failed + +**Symptom:** +``` +Error: Reference title mismatch +Expected: "Study of Protein X" +Actual: "Study of protein X function" +``` + +**Causes:** +- Title in data doesn't match fetched title +- Partial title provided +- Capitalization differences + +**Solutions:** + +1. **Use exact title:** +```bash +# Fetch reference to see actual title +linkml-reference-validator cache reference PMID:12345678 +cat references_cache/PMID_12345678.md | head -20 +``` + +2. **Omit title if uncertain:** +```yaml +# Title validation is optional +reference_id: PMID:12345678 +# Don't include reference_title if unsure +supporting_text: "your quote" +``` + +3. **Understand title matching:** + - Titles must match completely (not substring) + - Case and punctuation are normalized + - But all words must match + +```yaml +# These match (after normalization): +reference_title: "Role of JAK1 in Cell-Signaling" +actual_title: "Role of JAK1 in Cell Signaling" + +# These DON'T match (partial): +reference_title: "Role of JAK1" +actual_title: "Role of JAK1 in Cell Signaling" +``` + +## Schema Issues + +### No reference or supporting_text fields found + +**Symptom:** +``` +Error: Could not find fields marked with linkml:authoritative_reference or linkml:excerpt +``` + +**Causes:** +- Schema doesn't have required slot_uri markers +- Using wrong field names +- Schema not properly configured + +**Solutions:** + +1. **Add slot_uri markers:** +```yaml +classes: + Evidence: + attributes: + reference: + slot_uri: linkml:authoritative_reference # Required + supporting_text: + slot_uri: linkml:excerpt # Required +``` + +2. **Or use implements:** +```yaml +classes: + Evidence: + attributes: + reference: + implements: + - linkml:authoritative_reference + supporting_text: + implements: + - linkml:excerpt +``` + +3. **Or use standard field names:** + - `reference`, `reference_id`, `pmid` for references + - `supporting_text`, `excerpt`, `quote` for text + +### Schema validation errors + +**Symptom:** +``` +LinkML schema validation failed +``` + +**Solutions:** + +1. **Validate schema separately:** +```bash +linkml-validate --schema schema.yaml schema.yaml +``` + +2. **Check required fields:** +```yaml +prefixes: + linkml: https://w3id.org/linkml/ # Must be defined + +classes: + MyClass: + tree_root: true # At least one class needs this +``` + +3. **Fix common issues:** +```yaml +# Bad: missing range +reference: + required: true + +# Good: includes range +reference: + required: true + range: string +``` + +## Data Format Issues + +### YAML parsing errors + +**Symptom:** +``` +yaml.scanner.ScannerError: mapping values are not allowed here +``` + +**Solutions:** + +1. **Check YAML syntax:** +```bash +# Use YAML validator +python -c "import yaml; yaml.safe_load(open('data.yaml'))" +``` + +2. **Common YAML mistakes:** + +```yaml +# Bad: missing quotes +supporting_text: Text with: colon + +# Good: quoted +supporting_text: "Text with: colon" + +# Bad: incorrect indentation +evidence: + reference: PMID:123 +supporting_text: "text" + +# Good: proper indentation +evidence: + reference: PMID:123 + supporting_text: "text" +``` + +3. **Use YAML linter:** +```bash +pip install yamllint +yamllint data.yaml +``` + +### Invalid reference ID format + +**Symptom:** +``` +Error: Invalid reference ID format: "invalid_id" +``` + +**Solutions:** + +Use correct format: +```yaml +# Correct formats: +reference_id: PMID:12345678 +reference_id: PMC:3458566 +reference_id: DOI:10.1038/nature12373 +reference_id: file:./path/to/file.md +reference_id: url:https://example.org/article + +# Incorrect: +reference_id: 12345678 # Missing PMID: prefix +reference_id: www.example.org # Missing url: prefix +reference_id: ./file.md # Missing file: prefix +``` + +## Performance Issues + +### Validation is very slow + +**Symptom:** +Validation takes minutes instead of seconds + +**Causes:** +- References not cached +- Network latency +- Large number of references +- Fetching full text for each validation + +**Solutions:** + +1. **Pre-cache references:** +```bash +# Extract all PMIDs from data +grep -r "PMID:" data/ | grep -o "PMID:[0-9]*" | sort -u > pmids.txt + +# Cache all +while read pmid; do + linkml-reference-validator cache reference "$pmid" +done < pmids.txt +``` + +2. **Use global cache:** +```bash +export REFERENCE_CACHE_DIR=~/.cache/linkml-reference-validator +``` + +3. **Use verbose mode to identify bottlenecks:** +```bash +linkml-reference-validator validate data data.yaml \ + --schema schema.yaml \ + --verbose +``` + +4. **Check cache hits:** +```bash +# Cached validations should be <100ms +# First fetch will be 2-3 seconds +``` + +### Large cache directory + +**Symptom:** +```bash +du -sh references_cache/ +500M references_cache/ +``` + +**Solutions:** + +1. **Clean old entries:** +```bash +# Remove cache entries older than 30 days +find references_cache/ -name "*.md" -mtime +30 -delete +``` + +2. **Use selective caching:** +```bash +# Cache only what you need +# Don't cache during experimentation +``` + +3. **Compress cache:** +```bash +tar -czf references_cache_backup.tar.gz references_cache/ +rm -rf references_cache/ +``` + +## Common Error Messages + +### "Text normalization resulted in empty string" + +**Cause:** +Text only contains punctuation or whitespace + +**Solution:** +```yaml +# Bad +supporting_text: "..." + +# Good +supporting_text: "text content ... more text" +``` + +### "Multiple reference fields found" + +**Cause:** +Schema has multiple fields marked as authoritative_reference + +**Solution:** +Only mark one field per class: +```yaml +# Bad +attributes: + pmid: + slot_uri: linkml:authoritative_reference + doi: + slot_uri: linkml:authoritative_reference + +# Good - use one field that can hold different types +attributes: + reference_id: + slot_uri: linkml:authoritative_reference +``` + +### "Reference base directory not found" + +**Cause:** +Using `file:` references but base directory not configured + +**Solution:** +```yaml +# In .linkml-reference-validator.yaml +validation: + reference_base_dir: ./references + +# Or use absolute paths +reference_id: file:/full/path/to/file.md +``` + +## Getting More Help + +### Enable verbose logging + +```bash +linkml-reference-validator validate text \ + "quote" PMID:12345678 \ + --verbose +``` + +### Check cache contents + +```bash +# View cached reference +cat references_cache/PMID_12345678.md + +# Check cache metadata +head -n 20 references_cache/PMID_12345678.md +``` + +### Test with simple example + +```bash +# Known working example +linkml-reference-validator validate text \ + "MUC1 oncoprotein blocks nuclear targeting of c-Abl" \ + PMID:16888623 +``` + +### Report bugs + +If you've found a bug: + +1. **Check existing issues:** + https://github.com/linkml/linkml-reference-validator/issues + +2. **Create minimal reproduction:** +```bash +# Simplest possible command that shows the issue +linkml-reference-validator validate text "test" PMID:12345678 --verbose +``` + +3. **Include:** + - Command you ran + - Expected behavior + - Actual behavior + - Error messages (full output) + - Schema (if applicable) + - Data file (if applicable, minimal example) + - Version: `linkml-reference-validator --version` + - Python version: `python --version` + - OS: `uname -a` (Linux/Mac) or `ver` (Windows) + +## Quick Diagnostic Checklist + +Run through this checklist when encountering issues: + +- [ ] Installation successful: `linkml-reference-validator --version` +- [ ] Network accessible: `ping www.ncbi.nlm.nih.gov` +- [ ] NCBI email set: `echo $NCBI_EMAIL` +- [ ] Cache directory writable: `touch references_cache/test && rm references_cache/test` +- [ ] Schema valid: `linkml-validate --schema schema.yaml schema.yaml` +- [ ] Data valid YAML: `python -c "import yaml; yaml.safe_load(open('data.yaml'))"` +- [ ] Reference exists: Visit PubMed URL for the PMID +- [ ] Simple test works: Validate known-good example + +## See Also + +- [Setup Guide](setup-guide.md) - Initial installation and configuration +- [Quickstart](quickstart.md) - Basic usage examples +- [CLI Reference](reference/cli.md) - Complete command documentation +- [How to Repair Validation Errors](how-to/repair-validation-errors.md) - Fixing common issues +- [GitHub Issues](https://github.com/linkml/linkml-reference-validator/issues) - Report bugs diff --git a/docs/tutorials/complete-workflow.md b/docs/tutorials/complete-workflow.md new file mode 100644 index 0000000..6263b6e --- /dev/null +++ b/docs/tutorials/complete-workflow.md @@ -0,0 +1,900 @@ +# Complete Workflow Tutorial: Building a Validated Gene Annotation System + +This tutorial walks you through building a complete gene annotation validation system from scratch, using real examples and best practices. + +## What We'll Build + +A validated gene function annotation system that: +- Stores gene function claims with supporting text from publications +- Automatically validates that quotes match their cited sources +- Supports multiple reference types (PMID, DOI, PMC) +- Includes repair capabilities for common errors +- Can be integrated into a CI/CD pipeline + +**Time required:** 30-45 minutes + +## Prerequisites + +- Python 3.10+ installed +- Basic understanding of YAML +- Familiarity with command line +- (Optional) NCBI API key for higher rate limits + +## Step 1: Installation and Setup (5 minutes) + +### Install the Tool + +```bash +# Using pip +pip install linkml-reference-validator + +# Or using uv (faster) +curl -LsSf https://astral.sh/uv/install.sh | sh +uv pip install linkml-reference-validator +``` + +### Create Project Structure + +```bash +# Create project directory +mkdir gene-annotation-validator +cd gene-annotation-validator + +# Create subdirectories +mkdir -p schemas data references_cache tests + +# Verify installation +linkml-reference-validator --version +``` + +### Configure NCBI Access (Optional) + +```bash +# Set environment variables +export NCBI_EMAIL="your.email@example.com" + +# Test with a simple validation +linkml-reference-validator validate text \ + "MUC1 oncoprotein blocks nuclear targeting of c-Abl" \ + PMID:16888623 +``` + +Expected output: +``` +Validating text against PMID:16888623... +Result: + Valid: True + Message: Supporting text validated successfully in PMID:16888623 +``` + +## Step 2: Design Your Data Model (10 minutes) + +### Create the LinkML Schema + +We'll create a schema for gene function annotations with evidence from literature. + +**schemas/gene_annotations.yaml:** +```yaml +id: https://example.org/gene-annotations +name: gene-annotations +description: Schema for validated gene function annotations + +prefixes: + linkml: https://w3id.org/linkml/ + dcterms: http://purl.org/dc/terms/ + biolink: https://w3id.org/biolink/vocab/ + +default_prefix: gene_annotations + +classes: + # Root container class + GeneAnnotationCollection: + tree_root: true + description: Collection of gene function annotations + attributes: + annotations: + multivalued: true + range: GeneAnnotation + description: List of gene annotations + + # Main annotation class + GeneAnnotation: + description: An annotation describing a gene's function with supporting evidence + attributes: + id: + identifier: true + required: true + description: Unique identifier for this annotation + + gene_symbol: + required: true + description: Official gene symbol (e.g., TP53, BRCA1) + pattern: "^[A-Z0-9]+$" + + gene_name: + description: Full gene name + + function_summary: + required: true + description: Brief summary of the gene's function + + function_category: + range: FunctionCategory + description: Broad categorization of gene function + + species: + range: Species + description: Species this annotation applies to + required: true + + evidence: + required: true + multivalued: true + range: Evidence + description: Supporting evidence from literature + + last_reviewed: + range: date + description: Date this annotation was last reviewed + + curator: + description: Person who created/reviewed this annotation + + # Evidence class with reference validation + Evidence: + description: Evidence supporting a gene function claim + attributes: + reference_id: + required: true + slot_uri: linkml:authoritative_reference + description: | + Reference identifier (PMID, PMC, DOI, or file path) + Examples: PMID:16888623, PMC:3458566, DOI:10.1038/nature12373 + + reference_title: + slot_uri: dcterms:title + description: Title of the referenced publication (validated if provided) + + supporting_text: + required: true + slot_uri: linkml:excerpt + description: | + Direct quote from the reference supporting the annotation. + Use [brackets] for editorial clarifications. + Use ... for omitted text between parts. + + evidence_type: + range: EvidenceType + description: Type of experimental evidence + + confidence: + range: ConfidenceLevel + description: Curator's confidence in this evidence + + notes: + description: Additional context or clarifications + +# Enumerations +enums: + FunctionCategory: + permissible_values: + TUMOR_SUPPRESSOR: + description: Prevents uncontrolled cell growth + ONCOGENE: + description: Promotes cell growth and division + DNA_REPAIR: + description: Repairs damaged DNA + TRANSCRIPTION_FACTOR: + description: Regulates gene expression + CELL_CYCLE_REGULATOR: + description: Controls cell cycle progression + KINASE: + description: Phosphorylates other proteins + PHOSPHATASE: + description: Removes phosphate groups + RECEPTOR: + description: Receives extracellular signals + SIGNALING: + description: Transmits cellular signals + + EvidenceType: + permissible_values: + EXPERIMENTAL: + description: Direct experimental evidence + COMPUTATIONAL: + description: Computational prediction or inference + LITERATURE: + description: Statement from literature without original data + CURATOR_INFERENCE: + description: Inferred by curator from related evidence + + ConfidenceLevel: + permissible_values: + HIGH: + description: Strong, consistent evidence + MEDIUM: + description: Good evidence but some uncertainty + LOW: + description: Limited or conflicting evidence + + Species: + permissible_values: + HUMAN: + description: Homo sapiens + MOUSE: + description: Mus musculus + RAT: + description: Rattus norvegicus + YEAST: + description: Saccharomyces cerevisiae +``` + +### Understanding the Schema + +Key elements: +- **`slot_uri: linkml:excerpt`** - Marks `supporting_text` for validation +- **`slot_uri: linkml:authoritative_reference`** - Marks `reference_id` as the reference +- **`slot_uri: dcterms:title`** - Optionally validates reference titles +- **Enumerations** - Controlled vocabularies for consistency +- **Required fields** - Ensures data completeness + +## Step 3: Create Sample Data (10 minutes) + +### Example 1: Simple Annotation + +**data/tp53_annotation.yaml:** +```yaml +annotations: + - id: ANN001 + gene_symbol: TP53 + gene_name: Tumor protein p53 + function_summary: Regulates cell cycle and acts as tumor suppressor + function_category: TUMOR_SUPPRESSOR + species: HUMAN + curator: Jane Doe + last_reviewed: 2024-01-15 + + evidence: + - reference_id: PMID:16888623 + reference_title: MUC1 oncoprotein blocks nuclear targeting of c-Abl + supporting_text: "MUC1 oncoprotein blocks nuclear targeting of c-Abl" + evidence_type: EXPERIMENTAL + confidence: HIGH +``` + +### Example 2: Multiple Evidence Items + +**data/brca1_annotation.yaml:** +```yaml +annotations: + - id: ANN002 + gene_symbol: BRCA1 + gene_name: Breast cancer type 1 susceptibility protein + function_summary: Critical role in DNA repair and tumor suppression + function_category: DNA_REPAIR + species: HUMAN + curator: John Smith + last_reviewed: 2024-02-20 + + evidence: + # Evidence 1: DNA repair function + - reference_id: PMID:12345678 + supporting_text: "BRCA1 plays a critical role in DNA double-strand break repair" + evidence_type: EXPERIMENTAL + confidence: HIGH + notes: Direct experimental demonstration + + # Evidence 2: Tumor suppressor function + - reference_id: PMID:23456789 + supporting_text: "BRCA1 functions as a tumor suppressor ... maintaining genomic stability" + evidence_type: EXPERIMENTAL + confidence: HIGH + notes: Used ellipsis to connect non-contiguous parts + + # Evidence 3: Using editorial notes + - reference_id: PMC:3458566 + supporting_text: "BRCA1 [breast cancer type 1] is involved in homologous recombination" + evidence_type: LITERATURE + confidence: MEDIUM + notes: Added gene name clarification in brackets +``` + +### Example 3: Mixed Reference Types + +**data/multi_gene_annotations.yaml:** +```yaml +annotations: + - id: ANN003 + gene_symbol: EGFR + gene_name: Epidermal growth factor receptor + function_summary: Receptor tyrosine kinase involved in cell proliferation + function_category: RECEPTOR + species: HUMAN + curator: Jane Doe + + evidence: + # Using DOI + - reference_id: DOI:10.1038/nature12373 + supporting_text: "EGFR is a receptor tyrosine kinase" + evidence_type: EXPERIMENTAL + confidence: HIGH + + # Using local file + - reference_id: file:./references/egfr_review.md + supporting_text: "EGFR mutations are found in many cancers" + evidence_type: LITERATURE + confidence: MEDIUM + notes: From local review article + + - id: ANN004 + gene_symbol: JAK1 + gene_name: Janus kinase 1 + function_summary: Tyrosine kinase in cytokine signaling + function_category: KINASE + species: HUMAN + curator: John Smith + + evidence: + # Using URL + - reference_id: url:https://example.org/jak1-article.html + supporting_text: "JAK1 is a key mediator of cytokine signaling" + evidence_type: LITERATURE + confidence: MEDIUM +``` + +## Step 4: Validate Your Data (10 minutes) + +### Basic Validation + +```bash +# Validate single file +linkml-reference-validator validate data \ + data/tp53_annotation.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection + +# Expected output: +# Validating data/tp53_annotation.yaml... +# ✓ All validations passed! +``` + +### Verbose Validation + +```bash +# See detailed validation info +linkml-reference-validator validate data \ + data/brca1_annotation.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection \ + --verbose + +# Shows: +# - Each reference being validated +# - What text is being searched for +# - Whether full text or abstract was used +# - Validation results for each item +``` + +### Batch Validation + +```bash +# Validate all files in data directory +for file in data/*.yaml; do + echo "Validating $file..." + linkml-reference-validator validate data \ + "$file" \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection +done +``` + +## Step 5: Handle Validation Errors (10 minutes) + +### Scenario 1: Character Encoding Issues + +Create a file with common encoding issues: + +**data/error_example1.yaml:** +```yaml +annotations: + - id: ANN005 + gene_symbol: TEST1 + function_summary: Test gene for CO2 transport + function_category: SIGNALING + species: HUMAN + + evidence: + - reference_id: PMID:16888623 + # This will fail: ASCII "O2" instead of subscript + supporting_text: "protein involved in O2 transport" + evidence_type: EXPERIMENTAL + confidence: HIGH +``` + +Validate and repair: + +```bash +# First validate to see the error +linkml-reference-validator validate data \ + data/error_example1.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection + +# Use repair to fix (dry run first) +linkml-reference-validator repair data \ + data/error_example1.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection \ + --dry-run + +# Review the suggested fixes, then apply +linkml-reference-validator repair data \ + data/error_example1.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection \ + --no-dry-run +``` + +### Scenario 2: Missing Ellipsis + +**data/error_example2.yaml:** +```yaml +annotations: + - id: ANN006 + gene_symbol: TEST2 + function_summary: Test gene + function_category: SIGNALING + species: HUMAN + + evidence: + - reference_id: PMID:16888623 + # This will fail: missing "..." between non-contiguous parts + supporting_text: "MUC1 oncoprotein blocks c-Abl" + evidence_type: EXPERIMENTAL + confidence: HIGH +``` + +The repair command will suggest adding ellipsis: +``` +Suggested fix (MEDIUM confidence): + "MUC1 oncoprotein blocks c-Abl" → "MUC1 oncoprotein ... blocks ... c-Abl" +``` + +### Scenario 3: Text Not in Reference + +**data/error_example3.yaml:** +```yaml +annotations: + - id: ANN007 + gene_symbol: TEST3 + function_summary: Test gene + function_category: SIGNALING + species: HUMAN + + evidence: + - reference_id: PMID:16888623 + # This will fail: text doesn't exist in reference + supporting_text: "completely fabricated text that doesn't exist" + evidence_type: EXPERIMENTAL + confidence: HIGH +``` + +The repair command will flag for removal: +``` +RECOMMENDED REMOVALS (low confidence): + PMID:16888623 at evidence[0]: + Similarity: 5% + Snippet: 'completely fabricated text that doesn't exist' + Action: Remove or find correct reference +``` + +## Step 6: Create Configuration File (5 minutes) + +Create a project configuration: + +**.linkml-reference-validator.yaml:** +```yaml +# Validation configuration +validation: + cache_dir: ./references_cache + + # Custom prefix mappings + reference_prefix_map: + pubmed: PMID + pmc: PMC + doi: DOI + + # Base directory for file:// references + reference_base_dir: ./references + +# Repair configuration +repair: + # Confidence thresholds + auto_fix_threshold: 0.95 + suggest_threshold: 0.80 + removal_threshold: 0.50 + + # Character normalization + character_mappings: + "O2": "O₂" + "CO2": "CO₂" + "H2O": "H₂O" + "N2": "N₂" + "+/-": "±" + "alpha": "α" + "beta": "β" + "gamma": "γ" + + # Skip references with known issues + skip_references: [] + + # Trusted references (manually verified) + trusted_low_similarity: [] +``` + +Use the configuration: + +```bash +linkml-reference-validator validate data \ + data/*.yaml \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection \ + --config .linkml-reference-validator.yaml +``` + +## Step 7: Integrate with Version Control (5 minutes) + +### Create Git Pre-commit Hook + +**.git/hooks/pre-commit:** +```bash +#!/bin/bash + +echo "🔍 Validating gene annotations..." + +# Validate all data files +for file in data/*.yaml; do + if [ -f "$file" ]; then + echo " Checking $file..." + + linkml-reference-validator validate data \ + "$file" \ + --schema schemas/gene_annotations.yaml \ + --target-class GeneAnnotationCollection \ + --config .linkml-reference-validator.yaml + + if [ $? -ne 0 ]; then + echo "❌ Validation failed for $file" + echo "" + echo "To fix errors, run:" + echo " linkml-reference-validator repair data $file --schema schemas/gene_annotations.yaml --dry-run" + exit 1 + fi + fi +done + +echo "✅ All validations passed!" +exit 0 +``` + +Make it executable: +```bash +chmod +x .git/hooks/pre-commit +``` + +### Create Makefile + +**Makefile:** +```makefile +.PHONY: validate validate-verbose repair clean test + +SCHEMA := schemas/gene_annotations.yaml +DATA_DIR := data +CONFIG := .linkml-reference-validator.yaml +TARGET_CLASS := GeneAnnotationCollection + +# Validate all data files +validate: + @echo "Validating all annotations..." + @for file in $(DATA_DIR)/*.yaml; do \ + echo "Checking $$file..."; \ + linkml-reference-validator validate data \ + $$file \ + --schema $(SCHEMA) \ + --target-class $(TARGET_CLASS) \ + --config $(CONFIG) || exit 1; \ + done + @echo "✅ All validations passed!" + +# Validate with verbose output +validate-verbose: + @for file in $(DATA_DIR)/*.yaml; do \ + echo "Checking $$file..."; \ + linkml-reference-validator validate data \ + $$file \ + --schema $(SCHEMA) \ + --target-class $(TARGET_CLASS) \ + --config $(CONFIG) \ + --verbose; \ + done + +# Show suggested repairs (dry run) +repair: + @for file in $(DATA_DIR)/*.yaml; do \ + echo "Checking repairs for $$file..."; \ + linkml-reference-validator repair data \ + $$file \ + --schema $(SCHEMA) \ + --target-class $(TARGET_CLASS) \ + --config $(CONFIG) \ + --dry-run; \ + done + +# Apply repairs +repair-apply: + @for file in $(DATA_DIR)/*.yaml; do \ + echo "Applying repairs to $$file..."; \ + linkml-reference-validator repair data \ + $$file \ + --schema $(SCHEMA) \ + --target-class $(TARGET_CLASS) \ + --config $(CONFIG) \ + --no-dry-run; \ + done + +# Clean cache +clean: + rm -rf references_cache/ + +# Run tests +test: validate + @echo "Running tests..." + @python -m pytest tests/ -v +``` + +Usage: +```bash +make validate # Validate all files +make validate-verbose # Verbose output +make repair # Show suggested repairs +make repair-apply # Apply repairs +make clean # Clear cache +``` + +## Step 8: CI/CD Integration + +### GitHub Actions + +**.github/workflows/validate-annotations.yml:** +```yaml +name: Validate Gene Annotations + +on: + push: + branches: [ main, develop ] + paths: + - 'data/**.yaml' + - 'schemas/**.yaml' + pull_request: + branches: [ main ] + paths: + - 'data/**.yaml' + - 'schemas/**.yaml' + +jobs: + validate: + runs-on: ubuntu-latest + + steps: + - name: Checkout code + uses: actions/checkout@v3 + + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: '3.11' + + - name: Install dependencies + run: | + pip install linkml-reference-validator + + - name: Cache references + uses: actions/cache@v3 + with: + path: references_cache + key: ${{ runner.os }}-references-${{ hashFiles('data/**/*.yaml') }} + restore-keys: | + ${{ runner.os }}-references- + + - name: Validate annotations + run: | + make validate + env: + NCBI_EMAIL: ${{ secrets.NCBI_EMAIL }} + NCBI_API_KEY: ${{ secrets.NCBI_API_KEY }} + + - name: Upload cache artifacts + if: always() + uses: actions/upload-artifact@v3 + with: + name: references-cache + path: references_cache/ + retention-days: 30 +``` + +## Step 9: Testing and Quality Assurance + +### Create Test Files + +**tests/test_validation.py:** +```python +#!/usr/bin/env python3 +"""Test suite for gene annotation validation.""" + +import subprocess +import yaml +from pathlib import Path + +DATA_DIR = Path("data") +SCHEMA = Path("schemas/gene_annotations.yaml") +TARGET_CLASS = "GeneAnnotationCollection" + +def test_schema_valid(): + """Test that schema itself is valid.""" + result = subprocess.run( + ["linkml-validate", "--schema", str(SCHEMA), str(SCHEMA)], + capture_output=True, + text=True + ) + assert result.returncode == 0, f"Schema validation failed: {result.stderr}" + +def test_all_data_files_valid(): + """Test that all data files validate against schema.""" + for data_file in DATA_DIR.glob("*.yaml"): + if "error" in data_file.name: + continue # Skip error example files + + print(f"Testing {data_file}...") + result = subprocess.run( + [ + "linkml-reference-validator", "validate", "data", + str(data_file), + "--schema", str(SCHEMA), + "--target-class", TARGET_CLASS + ], + capture_output=True, + text=True + ) + assert result.returncode == 0, \ + f"Validation failed for {data_file}: {result.stderr}" + +def test_data_completeness(): + """Test that all required fields are present.""" + for data_file in DATA_DIR.glob("*.yaml"): + if "error" in data_file.name: + continue + + with open(data_file) as f: + data = yaml.safe_load(f) + + # Check each annotation + for ann in data.get("annotations", []): + assert "id" in ann, f"Missing id in {data_file}" + assert "gene_symbol" in ann, f"Missing gene_symbol in {data_file}" + assert "evidence" in ann, f"Missing evidence in {data_file}" + + # Check each evidence item + for ev in ann["evidence"]: + assert "reference_id" in ev, f"Missing reference_id in {data_file}" + assert "supporting_text" in ev, f"Missing supporting_text in {data_file}" + +if __name__ == "__main__": + test_schema_valid() + test_all_data_files_valid() + test_data_completeness() + print("✅ All tests passed!") +``` + +Run tests: +```bash +python tests/test_validation.py +``` + +## Step 10: Documentation and Maintenance + +### Create README + +**README.md:** +```markdown +# Gene Annotation Validation System + +Validated gene function annotations with supporting evidence from literature. + +## Quick Start + +```bash +# Validate all annotations +make validate + +# Add new annotation +cp templates/annotation_template.yaml data/new_gene.yaml +# Edit data/new_gene.yaml with your annotation +make validate + +# Repair validation errors +make repair +``` + +## Directory Structure + +``` +. +├── schemas/ +│ └── gene_annotations.yaml # LinkML schema +├── data/ +│ ├── tp53_annotation.yaml # Gene annotations +│ └── ... +├── references_cache/ # Cached references +├── tests/ +│ └── test_validation.py # Test suite +├── .linkml-reference-validator.yaml # Config +└── Makefile # Build commands +``` + +## Contributing + +1. Create new annotation file in `data/` +2. Validate: `make validate` +3. Fix any errors: `make repair` +4. Commit and push (pre-commit hook will validate) +``` + +### Create Template + +**templates/annotation_template.yaml:** +```yaml +annotations: + - id: ANN_XXX # Replace with unique ID + gene_symbol: GENE_SYMBOL # Official gene symbol + gene_name: Full Gene Name + function_summary: Brief summary of function + function_category: CATEGORY # See schema for options + species: HUMAN # Or MOUSE, RAT, YEAST + curator: Your Name + last_reviewed: YYYY-MM-DD + + evidence: + - reference_id: PMID:XXXXXXXX # Or DOI:, PMC:, file:, url: + reference_title: Article title (optional but recommended) + supporting_text: "Direct quote from the reference" + evidence_type: EXPERIMENTAL # Or COMPUTATIONAL, LITERATURE, CURATOR_INFERENCE + confidence: HIGH # Or MEDIUM, LOW + notes: Additional context (optional) +``` + +## Summary + +You've now built a complete gene annotation validation system! You've learned: + +- ✅ How to install and configure linkml-reference-validator +- ✅ How to design a LinkML schema with validation markers +- ✅ How to create validated data files +- ✅ How to validate and repair data +- ✅ How to integrate validation into your workflow +- ✅ How to set up CI/CD for automatic validation +- ✅ How to write tests for your validation system + +## Next Steps + +1. **Expand your schema** - Add more gene attributes, relationships, or evidence types +2. **Import existing data** - Convert existing annotations to your new format +3. **Integrate with databases** - Export validated data to SQL, MongoDB, or RDF +4. **Build a web interface** - Create a UI for curators to add/edit annotations +5. **Set up monitoring** - Track validation success rates and common error patterns + +## Additional Resources + +- [linkml-reference-validator Documentation](https://linkml.github.io/linkml-reference-validator/) +- [LinkML Schema Language](https://linkml.io/) +- [PubMed E-utilities API](https://www.ncbi.nlm.nih.gov/books/NBK25501/) +- [Crossref API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) diff --git a/mkdocs.yml b/mkdocs.yml index 9ad2e88..8667c51 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -24,8 +24,10 @@ plugins: nav: - Home: index.md + - Setup Guide: setup-guide.md - Quickstart: quickstart.md - Tutorials: + - Complete Workflow: tutorials/complete-workflow.md - Getting Started (CLI): notebooks/01_getting_started.ipynb - Advanced Usage (CLI): notebooks/02_advanced_usage.ipynb - Validating OBO Files (CLI): notebooks/04_obo_validation.ipynb @@ -47,6 +49,7 @@ nav: - Editorial Conventions: concepts/editorial-conventions.md - Reference: - CLI Reference: reference/cli.md + - Troubleshooting: troubleshooting.md - Roadmap: todo.md exclude_docs: |