diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index aa6c6f89..8c829a36 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -1,220 +1,208 @@ -# Hyrax - An extensible Framework for Machine Learning in Astronomy - -**ALWAYS follow these instructions first and only fallback to additional search and context gathering if the information here is incomplete or found to be in error.** - -Hyrax is a Python-based tool for hunting rare and anomalous sources in large astronomical imaging surveys. It supports downloading cutouts, building latent representations, interactive visualization, and anomaly detection using PyTorch models. - -## Design Goals and North Stars - -**CRITICAL: Always keep these design principles in mind when making changes to Hyrax.** - -### 1. Low Code Interface -- **Minimize user-facing APIs**: Hyrax prioritizes configuration-driven workflows over complex programmatic APIs -- **Avoid API proliferation**: Don't create new user-facing APIs that we'll need to maintain indefinitely -- **Favor declarative over imperative**: Users should configure what they want, not how to get it -- **CLI-first approach**: The `hyrax` CLI tool with verb-based commands is the primary user interface -- **Configuration over code**: Use TOML configuration files extensively to control behavior - -### 2. Make Easy Things Easy, Hard Things Possible -- **Default workflows should "just work"**: Common use cases should require minimal configuration -- **Progressive complexity**: Simple tasks should be simple; advanced features available when needed -- **Sensible defaults**: Default configurations in `hyrax_default_config.toml` should handle common scenarios -- **Extensibility without complexity**: Advanced users can extend with custom models, datasets, and verbs -- **Clear extension points**: Well-documented base classes (`Verb`, model base classes, dataset classes) - -### 3. Support Reproducibility -- **Configuration as documentation**: Config files serve as complete records of how experiments were run -- **Version everything**: Track model versions, data versions, and configuration versions -- **Manifest files**: Maintain manifests of downloaded data and processed results -- **Deterministic defaults**: Random seeds and other sources of variability should be configurable -- **MLflow integration**: Log experiments systematically for comparison and reproduction -- **ONNX export**: Support model serialization for long-term reproducibility - -### 4. Smooth and Legible Migration When APIs Change -- **Clear deprecation warnings**: When changing APIs, provide helpful deprecation messages -- **Migration guides in documentation**: Document breaking changes with before/after examples -- **Backward compatibility when possible**: Maintain compatibility or provide clear upgrade paths -- **Version pinning guidance**: Help users understand which versions work together -- **Config schema validation**: Use Pydantic schemas to validate configurations and provide helpful error messages -- **Changelog discipline**: Maintain comprehensive changelog with breaking change notifications - -## Working Effectively - -### Bootstrap and Setup - NEVER CANCEL these commands -- Create virtual environment: `conda create -n hyrax python=3.10 && conda activate hyrax` -- Clone repository: `git clone https://github.com/lincc-frameworks/hyrax.git` -- **CRITICAL**: Install dependencies using `.setup_dev.sh` script: - - `cd hyrax && echo 'y' | bash .setup_dev.sh` -- NEVER CANCEL: Takes 5-15 minutes depending on network. Set timeout to 20+ minutes. - - Script installs package with `pip install -e .'[dev]'` and sets up pre-commit hooks - - **Note**: Script prompts for system install if no virtual environment detected - respond 'y' to proceed - - **Alternative manual installation** if script fails due to network issues: - - `python -m pip install --upgrade pip` first - - `python -m pip install -e .'[dev]'` -- NEVER CANCEL: Takes 5-15 minutes. Set timeout to 20+ minutes. - - `python -m pip install pre-commit && pre-commit install` - - `conda install pandoc` (for documentation) - - **Network Issues**: Installation may fail with ReadTimeoutError due to PyPI connectivity. Retry installation multiple times if needed. - -### Build and Test Commands - NEVER CANCEL these commands -- **Run tests**: `python -m pytest -m "not slow"` -- NEVER CANCEL: Takes 2-5 minutes. Set timeout to 10+ minutes. -- **Run tests with coverage**: `python -m pytest --cov=hyrax --cov-report=xml -m "not slow"` -- NEVER CANCEL: Takes 3-6 minutes. Set timeout to 10+ minutes. -- **Run slow tests**: `python -m pytest -m "slow"` -- NEVER CANCEL: Takes 10-20 minutes. Set timeout to 30+ minutes. -- **Run all tests**: `python -m pytest` -- NEVER CANCEL: Takes 15-25 minutes. Set timeout to 45+ minutes. -- **Run parallel tests**: `python -m pytest -n auto` (uses multiple cores) - -### CLI Usage and Functionality -- **Main CLI entry point**: `hyrax` command (defined in pyproject.toml as `hyrax = "hyrax_cli.main:main"`) -- **Check version**: `hyrax --version` -- **Get help**: `hyrax --help` -- **Available verbs/commands**: - - **Core operations**: `train`, `infer`, `download`, `prepare` - - **Analysis**: `umap`, `visualize`, `lookup` - - **Vector DB**: `save_to_database`, `database_connection` - - **Utilities**: `rebuild_manifest` -- **Verb-specific help**: `hyrax --help` (e.g., `hyrax train --help`) -- **Configuration**: Use `--runtime-config path/to/config.toml` or `-c path/to/config.toml` -- **Verb implementation**: All verbs are classes in `src/hyrax/verbs/` that inherit from `Verb` base class - -### Development and Code Quality - NEVER CANCEL these commands -- **Pre-commit checks**: `pre-commit run --all-files` -- NEVER CANCEL: Takes 3-8 minutes. Set timeout to 15+ minutes. -- **Linting with ruff**: `ruff check src/ tests/` -- Takes 10-30 seconds. -- **Format with ruff**: `ruff format src/ tests/` -- Takes 10-30 seconds. -- **Build documentation**: `sphinx-build -M html ./docs ./_readthedocs` -- NEVER CANCEL: Takes 2-4 minutes. Set timeout to 10+ minutes. - -## Validation and Testing - -### CRITICAL: Always run these validation steps after making changes -1. **NEVER CANCEL**: Lint and format code: `ruff check src/ tests/ && ruff format src/ tests/` -2. **NEVER CANCEL**: Run unit tests: `python -m pytest -m "not slow"` (timeout: 10+ minutes) -3. **NEVER CANCEL**: Run pre-commit hooks: `pre-commit run --all-files` (timeout: 15+ minutes) - -### Manual Validation Scenarios -After making changes, ALWAYS test these scenarios: -1. **CLI functionality**: Run `hyrax --help` and `hyrax --version` to ensure CLI works -2. **Import test**: `python -c "import hyrax; h = hyrax.Hyrax(); print('Success')"` -3. **Configuration loading**: Verify config loads with `hyrax.Hyrax()` constructor -4. **Verb functionality**: Test relevant verbs like `hyrax train --help` if modifying training code - -### Test Categories and Markers -- **Fast tests**: `python -m pytest -m "not slow"` (default test suite) -- **Slow tests**: `python -m pytest -m "slow"` (integration and E2E tests) -- **E2E tests**: Full end-to-end workflows testing models and datasets -- **Test datasets**: Uses built-in datasets like `HyraxCifarDataset`, `HSCDataSet` -- **Test models**: Primarily tests `HyraxAutoencoder` model -- **Parallel testing**: Use `-n auto` for multiprocessing - -### Timeout Values and Timing Expectations -- **NEVER CANCEL**: Package installation: 5-15 minutes (timeout: 20+ minutes) -- **NEVER CANCEL**: Unit tests: 2-5 minutes (timeout: 10+ minutes) -- **NEVER CANCEL**: Full test suite: 15-25 minutes (timeout: 45+ minutes) -- **NEVER CANCEL**: Pre-commit hooks: 3-8 minutes (timeout: 15+ minutes) -- **NEVER CANCEL**: Documentation build: 2-4 minutes (timeout: 10+ minutes) -- Code formatting/linting: 10-30 seconds - -### Network and Installation Issues -- **PyPI Connectivity**: May encounter ReadTimeoutError when installing packages -- **Retry Strategy**: If installation fails, wait 1-2 minutes and retry the same command -- **Alternative mirrors**: Consider using `--index-url` with alternative PyPI mirrors if persistent issues -- **Dependency conflicts**: The package has complex ML dependencies (PyTorch, etc.) which may cause conflicts - -## Repository Structure and Navigation - -### Key Directories -- `src/hyrax/`: Main package source code -- `src/hyrax_cli/`: CLI entry point (`main.py`) -- `src/hyrax/verbs/`: Command implementations (train, infer, download, etc.) -- `src/hyrax/data_sets/`: Dataset implementations -- `src/hyrax/models/`: Model definitions -- `src/hyrax/vector_dbs/`: Vector database implementations (ChromaDB, Qdrant) -- `tests/hyrax/`: Unit tests -- `docs/`: Documentation source files -- `benchmarks/`: Performance benchmarks -- `example_notebooks/`: Example Jupyter notebooks - -### Important Files -- `pyproject.toml`: Project configuration, dependencies, scripts -- `src/hyrax/hyrax_default_config.toml`: Default configuration template -- `.setup_dev.sh`: Development environment setup script -- `.pre-commit-config.yaml`: Pre-commit hook configuration -- `.github/workflows/`: CI/CD pipeline definitions - -### Configuration System -- Default config: `src/hyrax/hyrax_default_config.toml` -- Users can override with custom config files via `--runtime-config` -- Config sections: `[general]`, `[model]`, `[train]`, `[data_set]`, `[download]`, etc. - -## Common Tasks and Workflows - -### Adding New Features -1. **ALWAYS** run full validation first: `python -m pytest -m "not slow"` -2. Make changes in appropriate `src/hyrax/` subdirectory -3. Add tests in `tests/hyrax/` following existing patterns -4. **ALWAYS** run: `ruff format src/ tests/ && ruff check src/ tests/` -5. **ALWAYS** run: `python -m pytest -m "not slow"` (timeout: 10+ minutes) -6. **ALWAYS** run: `pre-commit run --all-files` (timeout: 15+ minutes) - -### Working with Models -- Models defined in `src/hyrax/models/` -- Built-in models: `HyraxAutoencoder`, `HyraxCNN` -- Model registry system automatically discovers models -- General model configuration in `[model]` section of config files -- Configurations for specific models in `[model.]` sections -- Training via `hyrax train` command -- Export to ONNX format supported - -### Working with Data -- Data loaders in `src/hyrax/data_sets/` -- Built-in datasets: `HSCDataSet`, `HyraxCifarDataset`, `LSSTDataset`, `FitsImageDataSet` -- Dataset splits: train/validation/test controlled by config -- Configuration in `[data_set]` section -- Default data directory: `./data/` -- Sample data includes HSC1k dataset for testing - -### Working with Vector Databases -- Implementations in `src/hyrax/vector_dbs/` -- Supported: ChromaDB, Qdrant -- Commands: `save_to_database`, `database_connection` -- Configuration in `[vector_db]` section - -## Notebook Development -- Jupyter integration via `holoviews`, `bokeh` for visualizations -- Interactive visualization via `hyrax visualize` verb -- Pre-executed examples in `docs/pre_executed/` - -## CI/CD and GitHub Workflows -- Main workflows in `.github/workflows/` -- **Testing**: `testing-and-coverage.yml` runs on PRs and main branch -- **Smoke test**: `smoke-test.yml` runs daily -- **Documentation**: `build-documentation.yml` builds docs -- **Benchmarks**: ASV benchmarks via `asv-*.yml` workflows -- **Pre-commit**: Automated via `pre-commit-ci.yml` - -## Troubleshooting -- **Import errors**: Ensure `pip install -e .'[dev]'` completed successfully -- **Network timeouts during install**: Retry installation multiple times, may require 3-5 attempts due to PyPI connectivity issues -- **ReadTimeoutError**: Common during installation - wait 1-2 minutes and retry the same pip command -- **CLI not found**: Verify installation with `pip list | grep hyrax` -- **Tests failing**: Check if in virtual environment and dependencies installed -- **Pre-commit issues**: Run `pre-commit install` if hooks not working -- **Permission issues**: Use `--user` flag with pip if encountering permission errors -- **Virtual environment**: Always use conda/venv to avoid system Python conflicts - -## Performance Notes -- Vector database operations can be slow with large datasets -- Benchmarks available in `benchmarks/` directory (run with `asv` tool) -- Use `--timeout` parameters appropriately for long-running operations -- ChromaDB performance degrades with vectors >10,000 elements -- UMAP fitting limited to 1024 samples by default for performance -- Benchmark tests include timing for CLI help commands, object construction, and vector DB operations - -## Common Command Reference +# GitHub Copilot Instructions for Hyrax + +**🔗 For comprehensive project information, see [HYRAX_GUIDE.md](../HYRAX_GUIDE.md) in the repository root** + +Hyrax is a low-code Python framework for machine learning in astronomy. This file provides GitHub Copilot-specific guidance. + +## Quick Reference + +**Project essentials:** +- Python 3.9+ with PyTorch, TOML config, CLI-first (`hyrax` command with verbs) +- Workflows: Data download → Training → Inference → Visualization → Vector search +- Plugin architecture: Models, datasets, and verbs auto-register via decorators +- Configuration: TOML files with Pydantic validation, hierarchical merging + +**For detailed information on:** +- Design principles and architectural conventions → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#design-principles) +- Repository structure and key files → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#repository-structure) +- Configuration system → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#configuration-system) +- Plugin architecture (models, datasets, verbs) → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#plugin-architecture-via-registries) +- Adding new components → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#adding-new-components) +- Data flow through system → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#data-flow) + +## Critical Guidelines for GitHub Copilot + +### Always Follow These Instructions First + +Trust these instructions and only search for additional context if information here or in [HYRAX_GUIDE.md](../HYRAX_GUIDE.md) is incomplete or incorrect. + +### Command Execution - Long-Running Operations + +**CRITICAL: Never cancel these commands.** Allow sufficient time for completion: + +| Operation | Duration | Required Timeout | +|-----------|----------|------------------| +| `bash .setup_dev.sh` | 5-15 min | 20+ minutes | +| `pip install -e .'[dev]'` | 5-15 min | 20+ minutes | +| `pytest -m "not slow"` | 2-5 min | 10+ minutes | +| `pytest` (all tests) | 15-25 min | 45+ minutes | +| `pytest -m slow` | 10-20 min | 30+ minutes | +| `pre-commit run --all-files` | 3-8 min | 15+ minutes | +| `sphinx-build` (docs) | 2-4 min | 10+ minutes | + +**Network issues:** Installation commands may encounter `ReadTimeoutError` from PyPI. If this occurs: +1. Wait 1-2 minutes +2. Retry the exact same command +3. May require 3-5 attempts to succeed + +### Development Setup + ```bash -# Full development setup +# Environment setup conda create -n hyrax python=3.10 && conda activate hyrax git clone https://github.com/lincc-frameworks/hyrax.git && cd hyrax + +# Recommended: Automated setup script echo 'y' | bash .setup_dev.sh +# Installs with pip install -e .'[dev]' and sets up pre-commit hooks +# Prompts for system install if no venv - respond 'y' -# Quick validation workflow -ruff check src/ tests/ && ruff format src/ tests/ -python -m pytest -m "not slow" -pre-commit run --all-files -``` \ No newline at end of file +# Alternative: Manual installation +pip install -e .'[dev]' && pre-commit install +``` + +### Essential Commands + +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#essential-commands) for full command reference. + +```bash +# Testing +pytest -m "not slow" # Fast tests (2-5 min) +pytest -n auto -m "not slow" # Parallel fast tests +pytest -m slow # Slow/E2E tests (10-20 min) +pytest # All tests (15-25 min) + +# Code quality +ruff format . && ruff check --fix . # Format and lint (30 sec) +pre-commit run --all-files # All checks (3-8 min) + +# CLI +hyrax --help # List verbs +hyrax --help # Verb-specific help +hyrax -c config.toml # Run with config + +# Documentation +sphinx-build -M html ./docs ./_readthedocs -T -E -d ./docs/_build/doctrees +``` +### Validation After Changes + +**CRITICAL: Always run these validation steps:** +1. Format and lint: `ruff format . && ruff check --fix .` (30 seconds) +2. Fast tests: `pytest -m "not slow"` (2-5 min, NEVER CANCEL) +3. Pre-commit: `pre-commit run --all-files` (3-8 min, NEVER CANCEL) + +**Manual validation scenarios:** +1. CLI: `hyrax --help` and `hyrax --version` +2. Import: `python -c "import hyrax; h = hyrax.Hyrax(); print('Success')"` +3. Config loading: Verify `hyrax.Hyrax()` constructor works +4. Relevant verbs: Test with `hyrax --help` + +## Key Implementation Details + +### Configuration System Pitfalls + +- **Use ConfigDict, not dict**: ConfigDict catches missing defaults at runtime +- **All keys need defaults**: Add to `src/hyrax/hyrax_default_config.toml` +- **Config is immutable**: No runtime mutations allowed after creation +- **Pydantic validation**: Use schemas in `src/hyrax/config_schemas/` for validation + +### Model Interface Requirements + +Models MUST implement (see [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#plugin-architecture-via-registries)): +- `forward()`: Forward pass through model +- `train_step()`: Single training step +- `prepare_inputs()`: Data preparation (replaces deprecated `to_tensor()`) + +Use `@hyrax_model` decorator for auto-registration and shape inference. + +### Testing Requirements + +- Mark long tests: `@pytest.mark.slow` (>5 min) +- Fast tests in pre-commit and CI (<5 min total) +- Always run fast tests after changes: `pytest -m "not slow"` +- Test fixtures in `tests/hyrax/conftest.py` +- Sample data via Pooch from Zenodo DOIs + +### Pre-commit Hooks Include + +- ruff linting and formatting +- pytest fast tests (not slow) +- sphinx documentation build +- jupyter notebook conversion +- Custom hook: prevents note-to-self comments + +## Repository Structure + +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#repository-structure) for complete details. + +**Quick navigation:** +``` +src/hyrax/ + ├── hyrax.py # Main orchestration class + ├── config_utils.py # ConfigManager, ConfigDict + ├── plugin_utils.py # Dynamic plugin loading + ├── train.py, pytorch_ignite.py # Training infrastructure + ├── hyrax_default_config.toml # Default configuration + ├── models/model_registry.py # @hyrax_model decorator + ├── data_sets/data_set_registry.py # Dataset registration + ├── verbs/verb_registry.py # @hyrax_verb decorator + ├── config_schemas/ # Pydantic validation + └── vector_dbs/ # ChromaDB, Qdrant + +src/hyrax_cli/main.py # CLI entry point +tests/hyrax/conftest.py, test_e2e.py # Test fixtures, E2E tests +.github/workflows/ # CI/CD pipelines +``` + +## Important Conventions + +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#code-style-and-conventions) for complete list. + +1. **Immutable Config**: ConfigDict prevents mutations; all keys need defaults +2. **Timestamped Results**: Verbs create unique directories (`YYYYMMDD-HHMMSS--`) +3. **Automatic Registration**: Use decorators (`@hyrax_model`, `@hyrax_verb`) or `__init_subclass__` +4. **Batch Indexing**: Inference includes `batch_index.npy` for ordered retrieval +5. **Transform Stacking**: `HyraxImageDataset._update_transform()` composes transforms +6. **External Plugins**: Config detects `name = "pkg.Class"`, auto-loads `pkg/default_config.toml` + +## Common Workflows + +### Adding New Model +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#adding-a-new-model) for details. +1. Subclass `torch.nn.Module` in `src/hyrax/models/` +2. Add `@hyrax_model("ModelName")` decorator +3. Implement: `forward()`, `train_step()`, `prepare_inputs()` +4. Available via: `hyrax train -c config.toml` (with `model.name = "ModelName"`) + +### Adding New Dataset +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#adding-a-new-dataset) for details. +1. Subclass `HyraxDataset` in `src/hyrax/data_sets/` +2. Set `_name` class attribute (triggers auto-registration) +3. Implement: `__len__()`, `__getitem__()`, metadata interface +4. For images: subclass `HyraxImageDataset` for transform stacking + +### Adding New Verb +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#adding-a-new-verb) for details. +1. Create class in `src/hyrax/verbs/` with `run()` and `run_cli()` +2. Add `@hyrax_verb("verb_name")` decorator +3. Implement `setup_parser(parser)` for CLI args +4. Set `add_parser_kwargs` for help text +5. Available via: `hyrax verb_name [args]` + +## Common Issues + +- **Import errors**: Verify `pip install -e .'[dev]'` completed +- **Network timeouts**: Retry 3-5 times with 1-2 min waits for PyPI connectivity +- **CLI not found**: Check with `pip list | grep hyrax` +- **Config key not found**: Add to `hyrax_default_config.toml` +- **Model not registering**: Ensure `@hyrax_model` decorator present +- **Verb not in CLI**: Ensure `@hyrax_verb` decorator present +- **Pre-commit not running**: Run `pre-commit install` + +## CI/CD Workflows + +- **testing-and-coverage.yml**: Runs on PRs and main (pytest with coverage) +- **smoke-test.yml**: Daily smoke tests +- **build-documentation.yml**: Sphinx documentation builds +- **asv-*.yml**: Performance benchmarks +- **pre-commit-ci.yml**: Automated pre-commit checks \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md index fa0a36fe..9309517d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,255 +1,210 @@ # CLAUDE.md -This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. - -## Project Overview -Hyrax is designed to be a low-code solution for rapid experimentation with machine learning in astronomy - -Hyrax helps scientists/astronomers handle much of the boilerplate code that is often required for a machine learning project in astronomy so that users can focus on their model development and downstream science. +This file provides Claude Code (claude.ai/code) specific guidance when working with this repository. + +**🔗 For comprehensive project information, architecture, and workflows, see [HYRAX_GUIDE.md](./HYRAX_GUIDE.md)** + +## Quick Reference + +Hyrax is a low-code Python tool for machine learning in astronomy. Key facts: +- **Tech Stack**: Python 3.9+, PyTorch, TOML configuration, CLI-first design +- **Main Workflows**: Data download → Training → Inference → Visualization → Vector search +- **Entry Point**: `hyrax` CLI with verb-based commands (train, infer, umap, visualize, etc.) +- **Configuration**: TOML files with Pydantic validation, hierarchical merging +- **Testing**: pytest with `@pytest.mark.slow` for long tests, parallel execution with `-n auto` + +**Always refer to [HYRAX_GUIDE.md](./HYRAX_GUIDE.md) for:** +- Design principles and architectural conventions +- Repository structure and key files +- Configuration system details +- Plugin architecture (models, datasets, verbs) +- Adding new components +- Data flow through the system + +## Claude-Specific Guidance + +### Command Execution Strategy + +**CRITICAL: Never cancel long-running commands.** Hyrax has several operations that require extended execution time: + +| Command | Typical Duration | Minimum Timeout | +|---------|------------------|-----------------| +| `bash .setup_dev.sh` | 5-15 minutes | 20 minutes | +| `pip install -e .'[dev]'` | 5-15 minutes | 20 minutes | +| `pytest -m "not slow"` | 2-5 minutes | 10 minutes | +| `pytest` (all tests) | 15-25 minutes | 45 minutes | +| `pytest -m slow` | 10-20 minutes | 30 minutes | +| `pre-commit run --all-files` | 3-8 minutes | 15 minutes | +| `sphinx-build ...` | 2-4 minutes | 10 minutes | +| `ruff check/format` | 10-30 seconds | 2 minutes | + +**Network Issues**: Installation commands may encounter `ReadTimeoutError` due to PyPI connectivity. If this occurs: +1. Wait 1-2 minutes +2. Retry the exact same command +3. May require 3-5 retry attempts + +### Task Delegation with Sub-Agents + +Claude Code provides specialized sub-agents via the `task` tool. Use them proactively: + +**When to use the `explore` agent:** +- Questions requiring codebase understanding or synthesis +- Multi-step searches requiring analysis +- When you want a summarized answer, not raw grep/glob results +- Examples: "How does authentication work?", "Where are API endpoints defined?" + +**When to use the `task` agent:** +- Executing commands with verbose output (tests, builds, lints, dependency installs) +- Returns brief summary on success, full output on failure +- Keeps main context clean by minimizing successful output + +**When to use direct tools (grep/glob):** +- Simple, targeted single searches where you know what to find +- Need results immediately in your context +- Looking for something specific, not discovering something unknown + +**Parallel searches** - Call multiple grep/glob in ONE response: +```python +# Good: Parallel search calls +grep(pattern="function handleSubmit", glob="*.ts") +grep(pattern="interface FormData", glob="*.ts") +glob(pattern="**/*.tsx") +``` -Hyrax supports a few primary workflows: -1. Downloading/Accessing data from specific public data repositories (e.g. HSC, Rubin-LSST) -2. Training supervised/unsupervised ML algorithms using the above data or other data a user chooses to bring to Hyrax -3. Performing inference using the above models -4. Building interactive two and three dimensional latent spaces using the above tools -4. Building vector databased with inference results for rapid similarity search and outlier detection. +### Working Patterns -Hyrax is model-agnostic and extensible, supporting any PyTorch-based algorithm. +**Initial exploration:** +1. Use `explore` agent for codebase questions: "What does this module do?" +2. Use grep/glob for targeted searches: "Find all test files" +3. View key files identified: config files, main modules -## Development Setup +**Making changes:** +1. Always validate first: run relevant fast tests to establish baseline +2. Make minimal, surgical changes +3. Test immediately after changes: `pytest tests/hyrax/test_.py` +4. Format and lint: `ruff format . && ruff check --fix .` +5. Run full validation: `pytest -m "not slow"` (NEVER CANCEL, 10+ min timeout) +6. Run pre-commit: `pre-commit run --all-files` (NEVER CANCEL, 15+ min timeout) +**Common validation workflow:** ```bash -# Clone and setup environment -git clone https://github.com/lincc-frameworks/hyrax.git -conda create -n hyrax python=3.10 -conda activate hyrax +# Quick format/lint (30 seconds) +ruff format src/ tests/ && ruff check src/ tests/ -# For developers - installs package in editable mode, dev dependencies, and pre-commit hooks -bash .setup_dev.sh +# Fast tests (2-5 minutes, NEVER CANCEL) +pytest -m "not slow" -# Manual installation -pip install -e . # Runtime dependencies only -pip install -e .'[dev]' # Include dev dependencies +# Pre-commit (3-8 minutes, NEVER CANCEL) +pre-commit run --all-files ``` -## Common Commands +### Manual Validation After Changes -### Testing -```bash -# Run all tests (excluding slow tests) -pytest -m "not slow" +After making code changes, ALWAYS run these validation scenarios: -# Run tests in parallel -pytest -n auto -m "not slow" +1. **CLI functionality**: `hyrax --help` and `hyrax --version` ensure CLI works +2. **Import test**: `python -c "import hyrax; h = hyrax.Hyrax(); print('Success')"` +3. **Configuration loading**: Verify config loads correctly +4. **Verb functionality**: Test relevant verbs like `hyrax train --help` -# Run with coverage -pytest -n auto --cov=./src --cov-report=html -m "not slow" +### Important Notes for Claude Code -# Run slow tests (includes end-to-end tests) -pytest -m slow +**Batch editing**: Use the `edit` tool multiple times in a single response for: +- Renaming variables across multiple locations in the same file +- Editing non-overlapping blocks in the same or different files +- Applying the same pattern across multiple files -# Run specific test file -pytest tests/hyrax/test_config_utils.py +**Configuration system pitfalls**: +- Use `ConfigDict` instead of regular dict to catch missing defaults at runtime +- All config keys MUST have defaults in `hyrax_default_config.toml` +- Config is immutable after creation - no runtime mutations allowed -# Run specific test function -pytest tests/hyrax/test_infer.py::test_infer_basic -``` +**Model interface requirements**: +- Models MUST implement: `forward()`, `train_step()`, `prepare_inputs()` +- Note: `to_tensor()` is deprecated, use `prepare_inputs()` instead +- Use `@hyrax_model` decorator for auto-registration -### Linting -```bash -# Run ruff linting with auto-fix -ruff check --fix . +**Testing requirements**: +- Mark long-running tests with `@pytest.mark.slow` +- Fast tests (<5 min) run in pre-commit and CI +- Slow tests (>5 min) run separately +- Always run fast tests after changes: `pytest -m "not slow"` -# Format code with ruff -ruff format . +**Pre-commit hooks include**: +- ruff linting and formatting +- pytest fast tests (not slow) +- sphinx documentation build +- jupyter notebook conversion +- Custom hook preventing note-to-self comments -# Run pre-commit hooks manually -pre-commit run --all-files -``` +## Key File Locations -### Documentation -```bash -# Build documentation locally -sphinx-build -M html ./docs ./_readthedocs -T -E -d ./docs/_build/doctrees -``` +Reference [HYRAX_GUIDE.md](./HYRAX_GUIDE.md#repository-structure) for full structure. Quick access: -### CLI Usage -```bash -# Show available verbs (commands) -hyrax --help - -# Show version -hyrax --version - -# Run with custom config -hyrax -c path/to/config.toml - -# Common verbs -hyrax train # Train a model -hyrax infer # Run inference to generate latent space -hyrax umap # Dimensionality reduction on inference results -hyrax visualize # Interactive visualization -hyrax save_to_database # Populate vector DB from inference -hyrax lookup # Query vector DB ``` +src/hyrax/ + ├── hyrax.py # Main Hyrax class + ├── config_utils.py # ConfigManager, ConfigDict + ├── plugin_utils.py # get_or_load_class() for dynamic loading + ├── train.py # Training orchestration + ├── pytorch_ignite.py # Dataset, model, dataloader setup + ├── hyrax_default_config.toml # Default configuration + ├── models/model_registry.py # Model registration, @hyrax_model + ├── data_sets/data_set_registry.py # Dataset registration + ├── verbs/verb_registry.py # Verb registration, @hyrax_verb + ├── config_schemas/ # Pydantic validation schemas + └── vector_dbs/ # ChromaDB, Qdrant implementations + +src/hyrax_cli/main.py # CLI entry point + +tests/hyrax/ + ├── conftest.py # Shared test fixtures + └── test_e2e.py # End-to-end integration tests + +.github/ + ├── copilot-instructions.md # GitHub Copilot instructions + └── workflows/ # CI/CD pipelines +``` + +## Common Pitfalls and Solutions -## Architecture Overview +**Pitfall**: Forgetting to activate virtual environment +- **Solution**: Always check with `which python` or `pip list | grep hyrax` -### Core Design Pattern: Plugin Architecture via Registries +**Pitfall**: Tests failing due to network issues during fixture download +- **Solution**: Tests use Pooch for reproducible downloads from Zenodo - retry if network fails -Hyrax uses three primary registries for extensibility: +**Pitfall**: Pre-commit hooks not running +- **Solution**: Ensure `pre-commit install` was run after `pip install` -1. **MODEL_REGISTRY** (`models/model_registry.py`): Maps model names to PyTorch nn.Module classes - - The `@hyrax_model` decorator auto-registers models and injects standard interface methods - - Models must implement: `forward()`, `train_step()`, `to_tensor()` - - Automatic shape inference from dataset samples +**Pitfall**: Config key not found errors +- **Solution**: Add missing key to `hyrax_default_config.toml` with sensible default -2. **DATA_SET_REGISTRY** (`data_sets/data_set_registry.py`): Maps dataset names to HyraxDataset classes - - Uses `__init_subclass__` for automatic registration when subclasses are defined - - Base class provides metadata interface, ID generation, catalog access +**Pitfall**: Model not registering +- **Solution**: Ensure `@hyrax_model("ModelName")` decorator is present and file is imported -3. **VERB_REGISTRY** (`verbs/verb_registry.py`): Maps CLI command names to Verb classes - - The `@hyrax_verb` decorator registers verbs - - Verbs can be class-based (with `run()` and `run_cli()` methods) or function-based +**Pitfall**: Verb not appearing in CLI +- **Solution**: Ensure `@hyrax_verb("verb_name")` decorator is present and verb is imported -### External Plugin Support +## Quick Command Reference -External libraries can provide custom models/datasets/verbs by: -1. Setting config values like `name = "external_pkg.model.CustomModel"` -2. Providing a `default_config.toml` file in the package root -3. Hyrax's `get_or_load_class()` in `plugin_utils.py` handles dynamic import and config merging +See [HYRAX_GUIDE.md](./HYRAX_GUIDE.md#essential-commands) for full command reference. -### Configuration System +```bash +# Development setup +conda create -n hyrax python=3.10 && conda activate hyrax +cd hyrax && echo 'y' | bash .setup_dev.sh # NEVER CANCEL: 20+ min timeout -- **TOML-based hierarchical configuration** with strong validation -- **ConfigManager** merges: `hyrax_default_config.toml` + external library configs + user runtime config -- **ConfigDict** enforces all keys must have defaults (prevents silent config bugs) -- Automatic path resolution for relative paths -- Use `ConfigDict` instead of regular dict in new code to catch missing defaults at runtime +# Quick validation (run after changes) +ruff format src/ tests/ && ruff check src/ tests/ # 30 seconds +pytest -m "not slow" # NEVER CANCEL: 10+ min +pre-commit run --all-files # NEVER CANCEL: 15+ min -### Data Flow Through the System +# Specific tests +pytest tests/hyrax/test_config_utils.py # Single file +pytest tests/hyrax/test_infer.py::test_infer_basic # Single test -``` -1. DOWNLOAD (optional) - - Catalog (FITS) → Downloader → Cutout images + manifest.fits - - Stored in config[general][data_dir] - -2. PREPROCESSING (implicit in dataset) - - Dataset loads raw images → applies transforms (crop, tanh, etc.) - - Split into train/validate/test via SubsetSequentialSampler - - DataLoader batching with optional caching - -3. TRAINING - - train.py orchestrates: setup_dataset → setup_model → create_trainer - - Model.train_step() called per batch - - Checkpoints saved to timestamped results_dir - - MLflow logs metrics/params - -4. LATENT SPACE (Inference) - - Infer verb: Model.forward(batch) → latent vectors - - InferenceDataSetWriter saves: batch_.npy files + batch_index.npy - - Optional: SaveToDatabase → ChromaDB for similarity search - - Umap verb: reduces latent space to 2D/3D - -5. VISUALIZATION/SEARCH - - Visualize: InferenceDataSet reads umap results → Holoviews scatter plot - - Lookup: Query ChromaDB by ID or vector → k-nearest neighbors +# CLI verification +hyrax --help && hyrax --version # Verify CLI works ``` -### Key Abstractions - -**Hyrax class** (`hyrax.py`): Central orchestration interface that wraps all functionality. Provides both programmatic and CLI access to all verbs via dynamic `__getattr__` that instantiates verb classes on demand. - -**HyraxDataset** (`data_sets/`): Base class for all datasets -- Subclasses automatically register via `__init_subclass__` -- Must provide metadata interface (fields, catalog data) -- `HyraxImageDataset` mixin provides transform stacking via `_update_transform()` -- Built-in datasets: HSCDataSet, LSSTDataset, FitsImageDataSet, HyraxCifarDataSet, InferenceDataSet - -**Model Registration**: The `@hyrax_model` decorator provides: -- Automatic shape inference by sampling the dataset -- Standardized save/load via PyTorch state_dict -- Criterion and optimizer loading from config -- Injection of common interface methods - -**Verb Pattern**: Base `Verb` class with `run()` (programmatic) and `run_cli()` (CLI) methods -- CLI autodiscovery via `all_verbs()` in registry -- Class-based verbs: Infer, Umap, Visualize, SaveToDatabase, Lookup, DatabaseConnection -- Function-based verbs: train, download, prepare, rebuild_manifest - -**Result Chaining**: Verbs create timestamped result directories (`YYYYMMDD-HHMMSS--`) -- `find_most_recent_results_dir()` enables automatic chaining between verbs -- InferenceDataSet preserves original dataset config for metadata access - -### Training Infrastructure - -- **PyTorch Ignite-based** distributed training (`pytorch_ignite.py`, `train.py`) -- `setup_dataset()`: Instantiates dataset from config -- `setup_model()`: Instantiates model, infers shape from dataset -- `dist_data_loader()`: Creates distributed data loaders with train/validate/test splits -- `create_trainer()`: Training engine with checkpointing, progress bars -- MLflow integration for experiment tracking -- TensorboardX for metric logging - -### Testing Conventions - -- **End-to-end tests** in `test_e2e.py` are parametrized across model/dataset combinations -- Use `@pytest.mark.slow` for long-running tests (skipped in pre-commit and CI) -- Test fixtures in `tests/hyrax/conftest.py` provide shared setup -- Sample data uses Pooch for reproducible downloads from Zenodo DOIs -- Pre-commit hook runs fast tests only: `pytest -n auto --cov=./src -m 'not slow'` - -## Important Architectural Conventions - -1. **Immutable Config**: ConfigDict prevents runtime mutations; all keys must have defaults -2. **Timestamped Results**: Every verb execution creates a unique directory preventing overwrites -3. **Metadata Preservation**: InferenceDataSet stores original dataset config to maintain catalog access -4. **Automatic Registration**: Use decorators (`@hyrax_model`, `@hyrax_verb`) or `__init_subclass__` - no manual registration -5. **Batch Indexing**: Inference results include `batch_index.npy` mapping object_ids → batch files (critical for ordered retrieval) -6. **Transform Stacking**: HyraxImageDataset uses `_update_transform()` to compose torchvision transforms in sequence -7. **Distributed Training**: PyTorch Ignite's `idist.auto_dataloader()` abstracts single/multi-GPU execution -8. **External Library Support**: Config system detects `name = "pkg.Class"` and auto-loads `pkg/default_config.toml` - -## Code Style - -- **Line length**: 110 characters (configured in pyproject.toml) -- **Python version**: >= 3.9, target 3.9 for compatibility -- **Linter**: ruff (replaces black, isort, flake8) -- **Docstrings**: Required for public classes and functions (enforced by ruff D101, D102, D103, D106) -- **Pre-commit hooks**: Run automatically on commit (includes ruff, pytest, sphinx-build, jupyter nbconvert) -- **No note-to-self comments**: Custom pre-commit hook prevents placeholder comments from being committed - -## Key Files and Modules - -- `src/hyrax/hyrax.py`: Main Hyrax orchestration class -- `src/hyrax/config_utils.py`: Configuration system (ConfigManager, ConfigDict) -- `src/hyrax/plugin_utils.py`: Dynamic plugin loading (`get_or_load_class`) -- `src/hyrax/train.py`: Training orchestration with PyTorch Ignite -- `src/hyrax/pytorch_ignite.py`: Setup functions for datasets, models, data loaders -- `src/hyrax_cli/main.py`: CLI entry point with auto-discovered verb subparsers -- `src/hyrax/models/model_registry.py`: Model registration and `@hyrax_model` decorator -- `src/hyrax/data_sets/data_set_registry.py`: Dataset registration -- `src/hyrax/verbs/verb_registry.py`: Verb registration and `@hyrax_verb` decorator -- `src/hyrax/vector_dbs/`: Vector database abstraction (ChromaDB implementation) -- `tests/hyrax/test_e2e.py`: End-to-end integration tests - -## Adding New Components - -### Adding a New Model -1. Subclass `torch.nn.Module` in `src/hyrax/models/` -2. Add `@hyrax_model` decorator with a unique name -3. Implement required methods: `forward()`, `train_step()`, `to_tensor()` -4. Model will auto-register and be available via CLI: `hyrax train -c config.toml` (with `model.name = "YourModelName"`) - -### Adding a New Dataset -1. Subclass `HyraxDataset` in `src/hyrax/data_sets/` -2. Set `_name` class attribute (triggers auto-registration via `__init_subclass__`) -3. Implement required methods: `__len__()`, `__getitem__()`, metadata interface -4. For image datasets, subclass `HyraxImageDataset` to get transform stacking - -### Adding a New Verb -1. Create a class in `src/hyrax/verbs/` that implements `run()` and optionally `run_cli()` -2. Add `@hyrax_verb("verb_name")` decorator -3. Implement `setup_parser(parser)` class method for CLI argument parsing -4. Set `add_parser_kwargs` class attribute for help text -5. Verb will be available via CLI: `hyrax verb_name [args]` and programmatically: `hyrax_instance.verb_name()` diff --git a/HYRAX_GUIDE.md b/HYRAX_GUIDE.md new file mode 100644 index 00000000..8066410f --- /dev/null +++ b/HYRAX_GUIDE.md @@ -0,0 +1,319 @@ +# Hyrax Development Guide + +This guide provides essential information for working with the Hyrax codebase. It is referenced by both CLAUDE.md and .github/copilot-instructions.md. + +## Project Overview + +Hyrax is a Python-based tool for hunting rare and anomalous sources in large astronomical imaging surveys. It provides a low-code solution for rapid experimentation with machine learning in astronomy. + +### Core Purpose +Hyrax helps scientists/astronomers handle much of the boilerplate code that is often required for a machine learning project in astronomy so that users can focus on their model development and downstream science. + +### Primary Workflows +1. **Data Access**: Downloading/accessing data from specific public data repositories (e.g. HSC, Rubin-LSST) +2. **Training**: Training supervised/unsupervised ML algorithms using astronomical data +3. **Inference**: Performing inference to generate latent representations +4. **Visualization**: Building interactive 2D and 3D latent spaces +5. **Vector Search**: Building vector databases with inference results for rapid similarity search and outlier detection + +### Technology Stack +- **Language**: Python >= 3.9 (target 3.9 for compatibility) +- **ML Framework**: PyTorch with PyTorch Ignite for distributed training +- **Configuration**: TOML-based hierarchical configuration system +- **CLI**: Verb-based command interface via `hyrax` command +- **Testing**: pytest with parallel execution support +- **Linting**: ruff (replaces black, isort, flake8) +- **Documentation**: Sphinx with ReadTheDocs + +## Design Principles + +### 1. Low Code Interface +- Minimize user-facing APIs - prioritize configuration-driven workflows +- Avoid API proliferation - don't create new APIs we'll need to maintain indefinitely +- Favor declarative over imperative configuration +- CLI-first approach with verb-based commands (`hyrax train`, `hyrax infer`, etc.) + +### 2. Make Easy Things Easy, Hard Things Possible +- Default workflows should "just work" with minimal configuration +- Progressive complexity - simple tasks simple, advanced features available when needed +- Sensible defaults in `hyrax_default_config.toml` +- Clear extension points via base classes (`Verb`, model base classes, dataset classes) + +### 3. Support Reproducibility +- Configuration files serve as complete records of experiments +- Version tracking for models, data, and configurations +- Manifest files for downloaded data and processed results +- MLflow integration for systematic experiment logging +- ONNX export support for long-term reproducibility + +### 4. Smooth and Legible Migration When APIs Change +- Clear deprecation warnings with helpful messages +- Migration guides in documentation with before/after examples +- Backward compatibility when possible +- Pydantic schemas for config validation with helpful error messages +- Comprehensive changelog with breaking change notifications + +## Development Setup + +### Environment Setup +```bash +# Create and activate virtual environment +conda create -n hyrax python=3.10 +conda activate hyrax + +# Clone repository +git clone https://github.com/lincc-frameworks/hyrax.git +cd hyrax + +# Install for development (recommended) +bash .setup_dev.sh +# This script: +# - Installs package with pip install -e .'[dev]' +# - Sets up pre-commit hooks +# - Takes 5-15 minutes depending on network +# - Prompts for system install if no venv detected - respond 'y' + +# Alternative manual installation +pip install -e .'[dev]' # Install with dev dependencies +pre-commit install # Set up pre-commit hooks +``` + +### Common Issues During Setup +- **ReadTimeoutError**: Installation may fail due to PyPI connectivity - retry multiple times if needed +- **Permission errors**: Use `--user` flag with pip if encountering permission errors +- **Virtual environment**: Always use conda/venv to avoid system Python conflicts + +## Essential Commands + +### Testing +```bash +# Fast tests (default, excludes slow tests) +pytest -m "not slow" # 2-5 minutes +pytest -n auto -m "not slow" # Parallel execution + +# Tests with coverage +pytest -n auto --cov=./src --cov-report=html -m "not slow" + +# Slow tests (includes end-to-end tests) +pytest -m slow # 10-20 minutes + +# All tests +pytest # 15-25 minutes +pytest -n auto # Parallel, faster + +# Specific test file or function +pytest tests/hyrax/test_config_utils.py +pytest tests/hyrax/test_infer.py::test_infer_basic +``` + +### Code Quality +```bash +# Linting and formatting +ruff check --fix . # 10-30 seconds +ruff format . # 10-30 seconds + +# Pre-commit hooks (run all checks) +pre-commit run --all-files # 3-8 minutes + +# Build documentation +sphinx-build -M html ./docs ./_readthedocs -T -E -d ./docs/_build/doctrees +``` + +### CLI Usage +```bash +# Get help +hyrax --help # List all verbs/commands +hyrax --version # Show version +hyrax --help # Help for specific verb + +# Common verbs +hyrax train -c config.toml # Train a model +hyrax infer -c config.toml # Generate latent representations +hyrax umap -c config.toml # Dimensionality reduction +hyrax visualize # Interactive visualization +hyrax save_to_database # Populate vector DB +hyrax lookup # Query vector DB +``` + +## Architecture Overview + +### Plugin Architecture via Registries + +Hyrax uses three primary registries for extensibility: + +1. **MODEL_REGISTRY** (`models/model_registry.py`) + - Maps model names to PyTorch nn.Module classes + - `@hyrax_model` decorator auto-registers models + - Models must implement: `forward()`, `train_step()`, `prepare_inputs()` (formerly `to_tensor()`) + - Automatic shape inference from dataset samples + +2. **DATA_SET_REGISTRY** (`data_sets/data_set_registry.py`) + - Maps dataset names to HyraxDataset classes + - Auto-registration via `__init_subclass__` when subclasses defined + - Base class provides metadata interface, ID generation, catalog access + +3. **VERB_REGISTRY** (`verbs/verb_registry.py`) + - Maps CLI command names to Verb classes + - `@hyrax_verb` decorator registers verbs + - Verbs can be class-based (`run()` and `run_cli()` methods) or function-based + +### Configuration System + +- **TOML-based hierarchical configuration** with strong validation via Pydantic schemas +- **ConfigManager** merges: `hyrax_default_config.toml` + external library configs + user runtime config +- **ConfigDict** enforces all keys must have defaults (prevents silent config bugs) +- Automatic path resolution for relative paths +- Config sections: `[general]`, `[model]`, `[train]`, `[data_set]`, `[download]`, etc. + +### External Plugin Support + +External libraries can provide custom models/datasets/verbs: +1. Set config values like `name = "external_pkg.model.CustomModel"` +2. Provide a `default_config.toml` file in the package root +3. Hyrax's `get_or_load_class()` in `plugin_utils.py` handles dynamic import and config merging + +### Data Flow + +``` +DOWNLOAD (optional) + ↓ Catalog (FITS) → Downloader → Cutout images + manifest.fits + +PREPROCESSING (implicit in dataset) + ↓ Dataset loads raw images → applies transforms → train/validate/test splits + +TRAINING + ↓ train.py: setup_dataset → setup_model → create_trainer → checkpoints + MLflow logs + +INFERENCE + ↓ Model.forward(batch) → latent vectors → batch_*.npy files + batch_index.npy + +VECTOR DB / VISUALIZATION + ↓ ChromaDB for similarity search | UMAP → 2D/3D → Holoviews scatter plot +``` + +### Key Abstractions + +**Hyrax class** (`hyrax.py`): Central orchestration interface wrapping all functionality. Provides both programmatic and CLI access via dynamic `__getattr__` that instantiates verb classes on demand. + +**HyraxDataset** (`data_sets/`): Base class for all datasets +- Subclasses auto-register via `__init_subclass__` +- Must provide metadata interface (fields, catalog data) +- `HyraxImageDataset` mixin provides transform stacking via `_update_transform()` +- Built-in: HSCDataSet, LSSTDataset, FitsImageDataSet, HyraxCifarDataSet, InferenceDataSet + +**Model Registration**: `@hyrax_model` decorator provides: +- Automatic shape inference by sampling dataset +- Standardized save/load via PyTorch state_dict +- Criterion and optimizer loading from config + +**Verb Pattern**: Base `Verb` class with `run()` (programmatic) and `run_cli()` (CLI) methods +- CLI autodiscovery via `all_verbs()` in registry +- Class-based: Infer, Umap, Visualize, SaveToDatabase, Lookup +- Function-based: train, download, prepare, rebuild_manifest + +**Result Chaining**: Verbs create timestamped directories (`YYYYMMDD-HHMMSS--`) +- `find_most_recent_results_dir()` enables automatic chaining between verbs +- InferenceDataSet preserves original dataset config for metadata access + +### Training Infrastructure + +- **PyTorch Ignite-based** distributed training (`pytorch_ignite.py`, `train.py`) +- `setup_dataset()`: Instantiates dataset from config +- `setup_model()`: Instantiates model, infers shape from dataset +- `dist_data_loader()`: Creates distributed data loaders with splits +- `create_trainer()`: Training engine with checkpointing, progress bars +- MLflow for experiment tracking, TensorboardX for metric logging + +## Repository Structure + +### Key Directories +``` +src/hyrax/ # Main package source code + ├── models/ # Model definitions + ├── data_sets/ # Dataset implementations + ├── verbs/ # Command implementations + ├── vector_dbs/ # Vector database implementations (ChromaDB, Qdrant) + └── config_schemas/ # Pydantic schemas for configuration validation +src/hyrax_cli/ # CLI entry point (main.py) +tests/hyrax/ # Unit and integration tests +docs/ # Documentation source files +benchmarks/ # Performance benchmarks (ASV) +example_notebooks/ # Example Jupyter notebooks +``` + +### Important Files +- `pyproject.toml`: Project configuration, dependencies, CLI entry points +- `src/hyrax/hyrax_default_config.toml`: Default configuration template +- `.setup_dev.sh`: Development environment setup script +- `.pre-commit-config.yaml`: Pre-commit hook configuration +- `.github/workflows/`: CI/CD pipeline definitions + +## Code Style and Conventions + +- **Line length**: 110 characters (configured in pyproject.toml) +- **Docstrings**: Required for public classes and functions (enforced by ruff D101-D106) +- **Pre-commit hooks**: Automatically run on commit (ruff, pytest, sphinx-build, jupyter nbconvert) +- **No note-to-self comments**: Custom pre-commit hook prevents placeholder comments + +### Important Architectural Conventions + +1. **Immutable Config**: ConfigDict prevents runtime mutations; all keys must have defaults +2. **Timestamped Results**: Every verb execution creates unique directory preventing overwrites +3. **Metadata Preservation**: InferenceDataSet stores original dataset config to maintain catalog access +4. **Automatic Registration**: Use decorators (`@hyrax_model`, `@hyrax_verb`) or `__init_subclass__` - no manual registration +5. **Batch Indexing**: Inference results include `batch_index.npy` mapping object_ids → batch files +6. **Transform Stacking**: HyraxImageDataset uses `_update_transform()` to compose torchvision transforms +7. **Distributed Training**: PyTorch Ignite's `idist.auto_dataloader()` abstracts single/multi-GPU execution +8. **External Library Support**: Config system detects `name = "pkg.Class"` and auto-loads `pkg/default_config.toml` + +## Testing Conventions + +- **End-to-end tests** in `test_e2e.py` parametrized across model/dataset combinations +- **Test markers**: `@pytest.mark.slow` for long-running tests (skipped in pre-commit and CI) +- **Test fixtures** in `tests/hyrax/conftest.py` provide shared setup +- **Sample data**: Uses Pooch for reproducible downloads from Zenodo DOIs +- **Pre-commit**: Runs fast tests only: `pytest -n auto --cov=./src -m 'not slow'` + +## CI/CD + +- **Testing**: `testing-and-coverage.yml` runs on PRs and main branch +- **Smoke test**: `smoke-test.yml` runs daily +- **Documentation**: `build-documentation.yml` builds docs +- **Benchmarks**: ASV benchmarks via `asv-*.yml` workflows +- **Pre-commit**: Automated via `pre-commit-ci.yml` + +## Adding New Components + +### Adding a New Model +1. Subclass `torch.nn.Module` in `src/hyrax/models/` +2. Add `@hyrax_model` decorator with unique name +3. Implement: `forward()`, `train_step()`, `prepare_inputs()` +4. Available via CLI: `hyrax train -c config.toml` (with `model.name = "YourModelName"`) + +### Adding a New Dataset +1. Subclass `HyraxDataset` in `src/hyrax/data_sets/` +2. Set `_name` class attribute (triggers auto-registration) +3. Implement: `__len__()`, `__getitem__()`, metadata interface +4. For images, subclass `HyraxImageDataset` to get transform stacking + +### Adding a New Verb +1. Create class in `src/hyrax/verbs/` with `run()` and optionally `run_cli()` +2. Add `@hyrax_verb("verb_name")` decorator +3. Implement `setup_parser(parser)` class method for CLI argument parsing +4. Set `add_parser_kwargs` class attribute for help text +5. Available via CLI: `hyrax verb_name [args]` + +## Troubleshooting + +- **Import errors**: Ensure `pip install -e .'[dev]'` completed successfully +- **Network timeouts**: Retry installation multiple times (3-5 attempts may be needed) +- **CLI not found**: Verify with `pip list | grep hyrax` +- **Tests failing**: Check virtual environment and dependencies +- **Pre-commit issues**: Run `pre-commit install` if hooks not working + +## Performance Notes + +- Vector database operations can be slow with large datasets +- ChromaDB performance degrades with vectors >10,000 elements +- UMAP fitting limited to 1024 samples by default for performance +- Benchmarks available in `benchmarks/` directory (run with `asv` tool)