diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index ebe2aa95..8c829a36 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -1,186 +1,208 @@ -# Hyrax - An extensible Framework for Machine Learning in Astronomy - -**ALWAYS follow these instructions first and only fallback to additional search and context gathering if the information here is incomplete or found to be in error.** - -Hyrax is a Python-based tool for hunting rare and anomalous sources in large astronomical imaging surveys. It supports downloading cutouts, building latent representations, interactive visualization, and anomaly detection using PyTorch models. - -## Working Effectively - -### Bootstrap and Setup - NEVER CANCEL these commands -- Create virtual environment: `conda create -n hyrax python=3.10 && conda activate hyrax` -- Clone repository: `git clone https://github.com/lincc-frameworks/hyrax.git` -- **CRITICAL**: Install dependencies using `.setup_dev.sh` script: - - `cd hyrax && echo 'y' | bash .setup_dev.sh` -- NEVER CANCEL: Takes 5-15 minutes depending on network. Set timeout to 20+ minutes. - - Script installs package with `pip install -e .'[dev]'` and sets up pre-commit hooks - - **Note**: Script prompts for system install if no virtual environment detected - respond 'y' to proceed - - **Alternative manual installation** if script fails due to network issues: - - `python -m pip install --upgrade pip` first - - `python -m pip install -e .'[dev]'` -- NEVER CANCEL: Takes 5-15 minutes. Set timeout to 20+ minutes. - - `python -m pip install pre-commit && pre-commit install` - - `conda install pandoc` (for documentation) - - **Network Issues**: Installation may fail with ReadTimeoutError due to PyPI connectivity. Retry installation multiple times if needed. - -### Build and Test Commands - NEVER CANCEL these commands -- **Run tests**: `python -m pytest -m "not slow"` -- NEVER CANCEL: Takes 2-5 minutes. Set timeout to 10+ minutes. -- **Run tests with coverage**: `python -m pytest --cov=hyrax --cov-report=xml -m "not slow"` -- NEVER CANCEL: Takes 3-6 minutes. Set timeout to 10+ minutes. -- **Run slow tests**: `python -m pytest -m "slow"` -- NEVER CANCEL: Takes 10-20 minutes. Set timeout to 30+ minutes. -- **Run all tests**: `python -m pytest` -- NEVER CANCEL: Takes 15-25 minutes. Set timeout to 45+ minutes. -- **Run parallel tests**: `python -m pytest -n auto` (uses multiple cores) - -### CLI Usage and Functionality -- **Main CLI entry point**: `hyrax` command (defined in pyproject.toml as `hyrax = "hyrax_cli.main:main"`) -- **Check version**: `hyrax --version` -- **Get help**: `hyrax --help` -- **Available verbs/commands**: - - **Core operations**: `train`, `infer`, `download`, `prepare` - - **Analysis**: `umap`, `visualize`, `lookup` - - **Vector DB**: `save_to_database`, `database_connection` - - **Utilities**: `rebuild_manifest` -- **Verb-specific help**: `hyrax --help` (e.g., `hyrax train --help`) -- **Configuration**: Use `--runtime-config path/to/config.toml` or `-c path/to/config.toml` -- **Verb implementation**: All verbs are classes in `src/hyrax/verbs/` that inherit from `Verb` base class - -### Development and Code Quality - NEVER CANCEL these commands -- **Pre-commit checks**: `pre-commit run --all-files` -- NEVER CANCEL: Takes 3-8 minutes. Set timeout to 15+ minutes. -- **Linting with ruff**: `ruff check src/ tests/` -- Takes 10-30 seconds. -- **Format with ruff**: `ruff format src/ tests/` -- Takes 10-30 seconds. -- **Build documentation**: `sphinx-build -M html ./docs ./_readthedocs` -- NEVER CANCEL: Takes 2-4 minutes. Set timeout to 10+ minutes. - -## Validation and Testing - -### CRITICAL: Always run these validation steps after making changes -1. **NEVER CANCEL**: Lint and format code: `ruff check src/ tests/ && ruff format src/ tests/` -2. **NEVER CANCEL**: Run unit tests: `python -m pytest -m "not slow"` (timeout: 10+ minutes) -3. **NEVER CANCEL**: Run pre-commit hooks: `pre-commit run --all-files` (timeout: 15+ minutes) - -### Manual Validation Scenarios -After making changes, ALWAYS test these scenarios: -1. **CLI functionality**: Run `hyrax --help` and `hyrax --version` to ensure CLI works -2. **Import test**: `python -c "import hyrax; h = hyrax.Hyrax(); print('Success')"` -3. **Configuration loading**: Verify config loads with `hyrax.Hyrax()` constructor -4. **Verb functionality**: Test relevant verbs like `hyrax train --help` if modifying training code - -### Test Categories and Markers -- **Fast tests**: `python -m pytest -m "not slow"` (default test suite) -- **Slow tests**: `python -m pytest -m "slow"` (integration and E2E tests) -- **E2E tests**: Full end-to-end workflows testing models and datasets -- **Test datasets**: Uses built-in datasets like `HyraxCifarDataset`, `HSCDataSet` -- **Test models**: Primarily tests `HyraxAutoencoder` model -- **Parallel testing**: Use `-n auto` for multiprocessing - -### Timeout Values and Timing Expectations -- **NEVER CANCEL**: Package installation: 5-15 minutes (timeout: 20+ minutes) -- **NEVER CANCEL**: Unit tests: 2-5 minutes (timeout: 10+ minutes) -- **NEVER CANCEL**: Full test suite: 15-25 minutes (timeout: 45+ minutes) -- **NEVER CANCEL**: Pre-commit hooks: 3-8 minutes (timeout: 15+ minutes) -- **NEVER CANCEL**: Documentation build: 2-4 minutes (timeout: 10+ minutes) -- Code formatting/linting: 10-30 seconds - -### Network and Installation Issues -- **PyPI Connectivity**: May encounter ReadTimeoutError when installing packages -- **Retry Strategy**: If installation fails, wait 1-2 minutes and retry the same command -- **Alternative mirrors**: Consider using `--index-url` with alternative PyPI mirrors if persistent issues -- **Dependency conflicts**: The package has complex ML dependencies (PyTorch, etc.) which may cause conflicts - -## Repository Structure and Navigation - -### Key Directories -- `src/hyrax/`: Main package source code -- `src/hyrax_cli/`: CLI entry point (`main.py`) -- `src/hyrax/verbs/`: Command implementations (train, infer, download, etc.) -- `src/hyrax/data_sets/`: Dataset implementations -- `src/hyrax/models/`: Model definitions -- `src/hyrax/vector_dbs/`: Vector database implementations (ChromaDB, Qdrant) -- `tests/hyrax/`: Unit tests -- `docs/`: Documentation source files -- `benchmarks/`: Performance benchmarks -- `example_notebooks/`: Example Jupyter notebooks - -### Important Files -- `pyproject.toml`: Project configuration, dependencies, scripts -- `src/hyrax/hyrax_default_config.toml`: Default configuration template -- `.setup_dev.sh`: Development environment setup script -- `.pre-commit-config.yaml`: Pre-commit hook configuration -- `.github/workflows/`: CI/CD pipeline definitions - -### Configuration System -- Default config: `src/hyrax/hyrax_default_config.toml` -- Users can override with custom config files via `--runtime-config` -- Config sections: `[general]`, `[model]`, `[train]`, `[data_set]`, `[download]`, etc. - -## Common Tasks and Workflows - -### Adding New Features -1. **ALWAYS** run full validation first: `python -m pytest -m "not slow"` -2. Make changes in appropriate `src/hyrax/` subdirectory -3. Add tests in `tests/hyrax/` following existing patterns -4. **ALWAYS** run: `ruff format src/ tests/ && ruff check src/ tests/` -5. **ALWAYS** run: `python -m pytest -m "not slow"` (timeout: 10+ minutes) -6. **ALWAYS** run: `pre-commit run --all-files` (timeout: 15+ minutes) - -### Working with Models -- Models defined in `src/hyrax/models/` -- Built-in models: `HyraxAutoencoder`, `HyraxCNN` -- Model registry system automatically discovers models -- General model configuration in `[model]` section of config files -- Configurations for specific models in `[model.]` sections -- Training via `hyrax train` command -- Export to ONNX format supported - -### Working with Data -- Data loaders in `src/hyrax/data_sets/` -- Built-in datasets: `HSCDataSet`, `HyraxCifarDataset`, `LSSTDataset`, `FitsImageDataSet` -- Dataset splits: train/validation/test controlled by config -- Configuration in `[data_set]` section -- Default data directory: `./data/` -- Sample data includes HSC1k dataset for testing - -### Working with Vector Databases -- Implementations in `src/hyrax/vector_dbs/` -- Supported: ChromaDB, Qdrant -- Commands: `save_to_database`, `database_connection` -- Configuration in `[vector_db]` section - -## Notebook Development -- Jupyter integration via `holoviews`, `bokeh` for visualizations -- Interactive visualization via `hyrax visualize` verb -- Pre-executed examples in `docs/pre_executed/` - -## CI/CD and GitHub Workflows -- Main workflows in `.github/workflows/` -- **Testing**: `testing-and-coverage.yml` runs on PRs and main branch -- **Smoke test**: `smoke-test.yml` runs daily -- **Documentation**: `build-documentation.yml` builds docs -- **Benchmarks**: ASV benchmarks via `asv-*.yml` workflows -- **Pre-commit**: Automated via `pre-commit-ci.yml` - -## Troubleshooting -- **Import errors**: Ensure `pip install -e .'[dev]'` completed successfully -- **Network timeouts during install**: Retry installation multiple times, may require 3-5 attempts due to PyPI connectivity issues -- **ReadTimeoutError**: Common during installation - wait 1-2 minutes and retry the same pip command -- **CLI not found**: Verify installation with `pip list | grep hyrax` -- **Tests failing**: Check if in virtual environment and dependencies installed -- **Pre-commit issues**: Run `pre-commit install` if hooks not working -- **Permission issues**: Use `--user` flag with pip if encountering permission errors -- **Virtual environment**: Always use conda/venv to avoid system Python conflicts - -## Performance Notes -- Vector database operations can be slow with large datasets -- Benchmarks available in `benchmarks/` directory (run with `asv` tool) -- Use `--timeout` parameters appropriately for long-running operations -- ChromaDB performance degrades with vectors >10,000 elements -- UMAP fitting limited to 1024 samples by default for performance -- Benchmark tests include timing for CLI help commands, object construction, and vector DB operations - -## Common Command Reference +# GitHub Copilot Instructions for Hyrax + +**🔗 For comprehensive project information, see [HYRAX_GUIDE.md](../HYRAX_GUIDE.md) in the repository root** + +Hyrax is a low-code Python framework for machine learning in astronomy. This file provides GitHub Copilot-specific guidance. + +## Quick Reference + +**Project essentials:** +- Python 3.9+ with PyTorch, TOML config, CLI-first (`hyrax` command with verbs) +- Workflows: Data download → Training → Inference → Visualization → Vector search +- Plugin architecture: Models, datasets, and verbs auto-register via decorators +- Configuration: TOML files with Pydantic validation, hierarchical merging + +**For detailed information on:** +- Design principles and architectural conventions → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#design-principles) +- Repository structure and key files → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#repository-structure) +- Configuration system → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#configuration-system) +- Plugin architecture (models, datasets, verbs) → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#plugin-architecture-via-registries) +- Adding new components → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#adding-new-components) +- Data flow through system → [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#data-flow) + +## Critical Guidelines for GitHub Copilot + +### Always Follow These Instructions First + +Trust these instructions and only search for additional context if information here or in [HYRAX_GUIDE.md](../HYRAX_GUIDE.md) is incomplete or incorrect. + +### Command Execution - Long-Running Operations + +**CRITICAL: Never cancel these commands.** Allow sufficient time for completion: + +| Operation | Duration | Required Timeout | +|-----------|----------|------------------| +| `bash .setup_dev.sh` | 5-15 min | 20+ minutes | +| `pip install -e .'[dev]'` | 5-15 min | 20+ minutes | +| `pytest -m "not slow"` | 2-5 min | 10+ minutes | +| `pytest` (all tests) | 15-25 min | 45+ minutes | +| `pytest -m slow` | 10-20 min | 30+ minutes | +| `pre-commit run --all-files` | 3-8 min | 15+ minutes | +| `sphinx-build` (docs) | 2-4 min | 10+ minutes | + +**Network issues:** Installation commands may encounter `ReadTimeoutError` from PyPI. If this occurs: +1. Wait 1-2 minutes +2. Retry the exact same command +3. May require 3-5 attempts to succeed + +### Development Setup + ```bash -# Full development setup +# Environment setup conda create -n hyrax python=3.10 && conda activate hyrax git clone https://github.com/lincc-frameworks/hyrax.git && cd hyrax + +# Recommended: Automated setup script echo 'y' | bash .setup_dev.sh +# Installs with pip install -e .'[dev]' and sets up pre-commit hooks +# Prompts for system install if no venv - respond 'y' -# Quick validation workflow -ruff check src/ tests/ && ruff format src/ tests/ -python -m pytest -m "not slow" -pre-commit run --all-files -``` \ No newline at end of file +# Alternative: Manual installation +pip install -e .'[dev]' && pre-commit install +``` + +### Essential Commands + +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#essential-commands) for full command reference. + +```bash +# Testing +pytest -m "not slow" # Fast tests (2-5 min) +pytest -n auto -m "not slow" # Parallel fast tests +pytest -m slow # Slow/E2E tests (10-20 min) +pytest # All tests (15-25 min) + +# Code quality +ruff format . && ruff check --fix . # Format and lint (30 sec) +pre-commit run --all-files # All checks (3-8 min) + +# CLI +hyrax --help # List verbs +hyrax --help # Verb-specific help +hyrax -c config.toml # Run with config + +# Documentation +sphinx-build -M html ./docs ./_readthedocs -T -E -d ./docs/_build/doctrees +``` +### Validation After Changes + +**CRITICAL: Always run these validation steps:** +1. Format and lint: `ruff format . && ruff check --fix .` (30 seconds) +2. Fast tests: `pytest -m "not slow"` (2-5 min, NEVER CANCEL) +3. Pre-commit: `pre-commit run --all-files` (3-8 min, NEVER CANCEL) + +**Manual validation scenarios:** +1. CLI: `hyrax --help` and `hyrax --version` +2. Import: `python -c "import hyrax; h = hyrax.Hyrax(); print('Success')"` +3. Config loading: Verify `hyrax.Hyrax()` constructor works +4. Relevant verbs: Test with `hyrax --help` + +## Key Implementation Details + +### Configuration System Pitfalls + +- **Use ConfigDict, not dict**: ConfigDict catches missing defaults at runtime +- **All keys need defaults**: Add to `src/hyrax/hyrax_default_config.toml` +- **Config is immutable**: No runtime mutations allowed after creation +- **Pydantic validation**: Use schemas in `src/hyrax/config_schemas/` for validation + +### Model Interface Requirements + +Models MUST implement (see [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#plugin-architecture-via-registries)): +- `forward()`: Forward pass through model +- `train_step()`: Single training step +- `prepare_inputs()`: Data preparation (replaces deprecated `to_tensor()`) + +Use `@hyrax_model` decorator for auto-registration and shape inference. + +### Testing Requirements + +- Mark long tests: `@pytest.mark.slow` (>5 min) +- Fast tests in pre-commit and CI (<5 min total) +- Always run fast tests after changes: `pytest -m "not slow"` +- Test fixtures in `tests/hyrax/conftest.py` +- Sample data via Pooch from Zenodo DOIs + +### Pre-commit Hooks Include + +- ruff linting and formatting +- pytest fast tests (not slow) +- sphinx documentation build +- jupyter notebook conversion +- Custom hook: prevents note-to-self comments + +## Repository Structure + +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#repository-structure) for complete details. + +**Quick navigation:** +``` +src/hyrax/ + ├── hyrax.py # Main orchestration class + ├── config_utils.py # ConfigManager, ConfigDict + ├── plugin_utils.py # Dynamic plugin loading + ├── train.py, pytorch_ignite.py # Training infrastructure + ├── hyrax_default_config.toml # Default configuration + ├── models/model_registry.py # @hyrax_model decorator + ├── data_sets/data_set_registry.py # Dataset registration + ├── verbs/verb_registry.py # @hyrax_verb decorator + ├── config_schemas/ # Pydantic validation + └── vector_dbs/ # ChromaDB, Qdrant + +src/hyrax_cli/main.py # CLI entry point +tests/hyrax/conftest.py, test_e2e.py # Test fixtures, E2E tests +.github/workflows/ # CI/CD pipelines +``` + +## Important Conventions + +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#code-style-and-conventions) for complete list. + +1. **Immutable Config**: ConfigDict prevents mutations; all keys need defaults +2. **Timestamped Results**: Verbs create unique directories (`YYYYMMDD-HHMMSS--`) +3. **Automatic Registration**: Use decorators (`@hyrax_model`, `@hyrax_verb`) or `__init_subclass__` +4. **Batch Indexing**: Inference includes `batch_index.npy` for ordered retrieval +5. **Transform Stacking**: `HyraxImageDataset._update_transform()` composes transforms +6. **External Plugins**: Config detects `name = "pkg.Class"`, auto-loads `pkg/default_config.toml` + +## Common Workflows + +### Adding New Model +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#adding-a-new-model) for details. +1. Subclass `torch.nn.Module` in `src/hyrax/models/` +2. Add `@hyrax_model("ModelName")` decorator +3. Implement: `forward()`, `train_step()`, `prepare_inputs()` +4. Available via: `hyrax train -c config.toml` (with `model.name = "ModelName"`) + +### Adding New Dataset +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#adding-a-new-dataset) for details. +1. Subclass `HyraxDataset` in `src/hyrax/data_sets/` +2. Set `_name` class attribute (triggers auto-registration) +3. Implement: `__len__()`, `__getitem__()`, metadata interface +4. For images: subclass `HyraxImageDataset` for transform stacking + +### Adding New Verb +See [HYRAX_GUIDE.md](../HYRAX_GUIDE.md#adding-a-new-verb) for details. +1. Create class in `src/hyrax/verbs/` with `run()` and `run_cli()` +2. Add `@hyrax_verb("verb_name")` decorator +3. Implement `setup_parser(parser)` for CLI args +4. Set `add_parser_kwargs` for help text +5. Available via: `hyrax verb_name [args]` + +## Common Issues + +- **Import errors**: Verify `pip install -e .'[dev]'` completed +- **Network timeouts**: Retry 3-5 times with 1-2 min waits for PyPI connectivity +- **CLI not found**: Check with `pip list | grep hyrax` +- **Config key not found**: Add to `hyrax_default_config.toml` +- **Model not registering**: Ensure `@hyrax_model` decorator present +- **Verb not in CLI**: Ensure `@hyrax_verb` decorator present +- **Pre-commit not running**: Run `pre-commit install` + +## CI/CD Workflows + +- **testing-and-coverage.yml**: Runs on PRs and main (pytest with coverage) +- **smoke-test.yml**: Daily smoke tests +- **build-documentation.yml**: Sphinx documentation builds +- **asv-*.yml**: Performance benchmarks +- **pre-commit-ci.yml**: Automated pre-commit checks \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..9309517d --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,210 @@ +# CLAUDE.md + +This file provides Claude Code (claude.ai/code) specific guidance when working with this repository. + +**🔗 For comprehensive project information, architecture, and workflows, see [HYRAX_GUIDE.md](./HYRAX_GUIDE.md)** + +## Quick Reference + +Hyrax is a low-code Python tool for machine learning in astronomy. Key facts: +- **Tech Stack**: Python 3.9+, PyTorch, TOML configuration, CLI-first design +- **Main Workflows**: Data download → Training → Inference → Visualization → Vector search +- **Entry Point**: `hyrax` CLI with verb-based commands (train, infer, umap, visualize, etc.) +- **Configuration**: TOML files with Pydantic validation, hierarchical merging +- **Testing**: pytest with `@pytest.mark.slow` for long tests, parallel execution with `-n auto` + +**Always refer to [HYRAX_GUIDE.md](./HYRAX_GUIDE.md) for:** +- Design principles and architectural conventions +- Repository structure and key files +- Configuration system details +- Plugin architecture (models, datasets, verbs) +- Adding new components +- Data flow through the system + +## Claude-Specific Guidance + +### Command Execution Strategy + +**CRITICAL: Never cancel long-running commands.** Hyrax has several operations that require extended execution time: + +| Command | Typical Duration | Minimum Timeout | +|---------|------------------|-----------------| +| `bash .setup_dev.sh` | 5-15 minutes | 20 minutes | +| `pip install -e .'[dev]'` | 5-15 minutes | 20 minutes | +| `pytest -m "not slow"` | 2-5 minutes | 10 minutes | +| `pytest` (all tests) | 15-25 minutes | 45 minutes | +| `pytest -m slow` | 10-20 minutes | 30 minutes | +| `pre-commit run --all-files` | 3-8 minutes | 15 minutes | +| `sphinx-build ...` | 2-4 minutes | 10 minutes | +| `ruff check/format` | 10-30 seconds | 2 minutes | + +**Network Issues**: Installation commands may encounter `ReadTimeoutError` due to PyPI connectivity. If this occurs: +1. Wait 1-2 minutes +2. Retry the exact same command +3. May require 3-5 retry attempts + +### Task Delegation with Sub-Agents + +Claude Code provides specialized sub-agents via the `task` tool. Use them proactively: + +**When to use the `explore` agent:** +- Questions requiring codebase understanding or synthesis +- Multi-step searches requiring analysis +- When you want a summarized answer, not raw grep/glob results +- Examples: "How does authentication work?", "Where are API endpoints defined?" + +**When to use the `task` agent:** +- Executing commands with verbose output (tests, builds, lints, dependency installs) +- Returns brief summary on success, full output on failure +- Keeps main context clean by minimizing successful output + +**When to use direct tools (grep/glob):** +- Simple, targeted single searches where you know what to find +- Need results immediately in your context +- Looking for something specific, not discovering something unknown + +**Parallel searches** - Call multiple grep/glob in ONE response: +```python +# Good: Parallel search calls +grep(pattern="function handleSubmit", glob="*.ts") +grep(pattern="interface FormData", glob="*.ts") +glob(pattern="**/*.tsx") +``` + +### Working Patterns + +**Initial exploration:** +1. Use `explore` agent for codebase questions: "What does this module do?" +2. Use grep/glob for targeted searches: "Find all test files" +3. View key files identified: config files, main modules + +**Making changes:** +1. Always validate first: run relevant fast tests to establish baseline +2. Make minimal, surgical changes +3. Test immediately after changes: `pytest tests/hyrax/test_.py` +4. Format and lint: `ruff format . && ruff check --fix .` +5. Run full validation: `pytest -m "not slow"` (NEVER CANCEL, 10+ min timeout) +6. Run pre-commit: `pre-commit run --all-files` (NEVER CANCEL, 15+ min timeout) + +**Common validation workflow:** +```bash +# Quick format/lint (30 seconds) +ruff format src/ tests/ && ruff check src/ tests/ + +# Fast tests (2-5 minutes, NEVER CANCEL) +pytest -m "not slow" + +# Pre-commit (3-8 minutes, NEVER CANCEL) +pre-commit run --all-files +``` + +### Manual Validation After Changes + +After making code changes, ALWAYS run these validation scenarios: + +1. **CLI functionality**: `hyrax --help` and `hyrax --version` ensure CLI works +2. **Import test**: `python -c "import hyrax; h = hyrax.Hyrax(); print('Success')"` +3. **Configuration loading**: Verify config loads correctly +4. **Verb functionality**: Test relevant verbs like `hyrax train --help` + +### Important Notes for Claude Code + +**Batch editing**: Use the `edit` tool multiple times in a single response for: +- Renaming variables across multiple locations in the same file +- Editing non-overlapping blocks in the same or different files +- Applying the same pattern across multiple files + +**Configuration system pitfalls**: +- Use `ConfigDict` instead of regular dict to catch missing defaults at runtime +- All config keys MUST have defaults in `hyrax_default_config.toml` +- Config is immutable after creation - no runtime mutations allowed + +**Model interface requirements**: +- Models MUST implement: `forward()`, `train_step()`, `prepare_inputs()` +- Note: `to_tensor()` is deprecated, use `prepare_inputs()` instead +- Use `@hyrax_model` decorator for auto-registration + +**Testing requirements**: +- Mark long-running tests with `@pytest.mark.slow` +- Fast tests (<5 min) run in pre-commit and CI +- Slow tests (>5 min) run separately +- Always run fast tests after changes: `pytest -m "not slow"` + +**Pre-commit hooks include**: +- ruff linting and formatting +- pytest fast tests (not slow) +- sphinx documentation build +- jupyter notebook conversion +- Custom hook preventing note-to-self comments + +## Key File Locations + +Reference [HYRAX_GUIDE.md](./HYRAX_GUIDE.md#repository-structure) for full structure. Quick access: + +``` +src/hyrax/ + ├── hyrax.py # Main Hyrax class + ├── config_utils.py # ConfigManager, ConfigDict + ├── plugin_utils.py # get_or_load_class() for dynamic loading + ├── train.py # Training orchestration + ├── pytorch_ignite.py # Dataset, model, dataloader setup + ├── hyrax_default_config.toml # Default configuration + ├── models/model_registry.py # Model registration, @hyrax_model + ├── data_sets/data_set_registry.py # Dataset registration + ├── verbs/verb_registry.py # Verb registration, @hyrax_verb + ├── config_schemas/ # Pydantic validation schemas + └── vector_dbs/ # ChromaDB, Qdrant implementations + +src/hyrax_cli/main.py # CLI entry point + +tests/hyrax/ + ├── conftest.py # Shared test fixtures + └── test_e2e.py # End-to-end integration tests + +.github/ + ├── copilot-instructions.md # GitHub Copilot instructions + └── workflows/ # CI/CD pipelines +``` + +## Common Pitfalls and Solutions + +**Pitfall**: Forgetting to activate virtual environment +- **Solution**: Always check with `which python` or `pip list | grep hyrax` + +**Pitfall**: Tests failing due to network issues during fixture download +- **Solution**: Tests use Pooch for reproducible downloads from Zenodo - retry if network fails + +**Pitfall**: Pre-commit hooks not running +- **Solution**: Ensure `pre-commit install` was run after `pip install` + +**Pitfall**: Config key not found errors +- **Solution**: Add missing key to `hyrax_default_config.toml` with sensible default + +**Pitfall**: Model not registering +- **Solution**: Ensure `@hyrax_model("ModelName")` decorator is present and file is imported + +**Pitfall**: Verb not appearing in CLI +- **Solution**: Ensure `@hyrax_verb("verb_name")` decorator is present and verb is imported + +## Quick Command Reference + +See [HYRAX_GUIDE.md](./HYRAX_GUIDE.md#essential-commands) for full command reference. + +```bash +# Development setup +conda create -n hyrax python=3.10 && conda activate hyrax +cd hyrax && echo 'y' | bash .setup_dev.sh # NEVER CANCEL: 20+ min timeout + +# Quick validation (run after changes) +ruff format src/ tests/ && ruff check src/ tests/ # 30 seconds +pytest -m "not slow" # NEVER CANCEL: 10+ min +pre-commit run --all-files # NEVER CANCEL: 15+ min + +# Specific tests +pytest tests/hyrax/test_config_utils.py # Single file +pytest tests/hyrax/test_infer.py::test_infer_basic # Single test + +# CLI verification +hyrax --help && hyrax --version # Verify CLI works +``` + diff --git a/HYRAX_GUIDE.md b/HYRAX_GUIDE.md new file mode 100644 index 00000000..8066410f --- /dev/null +++ b/HYRAX_GUIDE.md @@ -0,0 +1,319 @@ +# Hyrax Development Guide + +This guide provides essential information for working with the Hyrax codebase. It is referenced by both CLAUDE.md and .github/copilot-instructions.md. + +## Project Overview + +Hyrax is a Python-based tool for hunting rare and anomalous sources in large astronomical imaging surveys. It provides a low-code solution for rapid experimentation with machine learning in astronomy. + +### Core Purpose +Hyrax helps scientists/astronomers handle much of the boilerplate code that is often required for a machine learning project in astronomy so that users can focus on their model development and downstream science. + +### Primary Workflows +1. **Data Access**: Downloading/accessing data from specific public data repositories (e.g. HSC, Rubin-LSST) +2. **Training**: Training supervised/unsupervised ML algorithms using astronomical data +3. **Inference**: Performing inference to generate latent representations +4. **Visualization**: Building interactive 2D and 3D latent spaces +5. **Vector Search**: Building vector databases with inference results for rapid similarity search and outlier detection + +### Technology Stack +- **Language**: Python >= 3.9 (target 3.9 for compatibility) +- **ML Framework**: PyTorch with PyTorch Ignite for distributed training +- **Configuration**: TOML-based hierarchical configuration system +- **CLI**: Verb-based command interface via `hyrax` command +- **Testing**: pytest with parallel execution support +- **Linting**: ruff (replaces black, isort, flake8) +- **Documentation**: Sphinx with ReadTheDocs + +## Design Principles + +### 1. Low Code Interface +- Minimize user-facing APIs - prioritize configuration-driven workflows +- Avoid API proliferation - don't create new APIs we'll need to maintain indefinitely +- Favor declarative over imperative configuration +- CLI-first approach with verb-based commands (`hyrax train`, `hyrax infer`, etc.) + +### 2. Make Easy Things Easy, Hard Things Possible +- Default workflows should "just work" with minimal configuration +- Progressive complexity - simple tasks simple, advanced features available when needed +- Sensible defaults in `hyrax_default_config.toml` +- Clear extension points via base classes (`Verb`, model base classes, dataset classes) + +### 3. Support Reproducibility +- Configuration files serve as complete records of experiments +- Version tracking for models, data, and configurations +- Manifest files for downloaded data and processed results +- MLflow integration for systematic experiment logging +- ONNX export support for long-term reproducibility + +### 4. Smooth and Legible Migration When APIs Change +- Clear deprecation warnings with helpful messages +- Migration guides in documentation with before/after examples +- Backward compatibility when possible +- Pydantic schemas for config validation with helpful error messages +- Comprehensive changelog with breaking change notifications + +## Development Setup + +### Environment Setup +```bash +# Create and activate virtual environment +conda create -n hyrax python=3.10 +conda activate hyrax + +# Clone repository +git clone https://github.com/lincc-frameworks/hyrax.git +cd hyrax + +# Install for development (recommended) +bash .setup_dev.sh +# This script: +# - Installs package with pip install -e .'[dev]' +# - Sets up pre-commit hooks +# - Takes 5-15 minutes depending on network +# - Prompts for system install if no venv detected - respond 'y' + +# Alternative manual installation +pip install -e .'[dev]' # Install with dev dependencies +pre-commit install # Set up pre-commit hooks +``` + +### Common Issues During Setup +- **ReadTimeoutError**: Installation may fail due to PyPI connectivity - retry multiple times if needed +- **Permission errors**: Use `--user` flag with pip if encountering permission errors +- **Virtual environment**: Always use conda/venv to avoid system Python conflicts + +## Essential Commands + +### Testing +```bash +# Fast tests (default, excludes slow tests) +pytest -m "not slow" # 2-5 minutes +pytest -n auto -m "not slow" # Parallel execution + +# Tests with coverage +pytest -n auto --cov=./src --cov-report=html -m "not slow" + +# Slow tests (includes end-to-end tests) +pytest -m slow # 10-20 minutes + +# All tests +pytest # 15-25 minutes +pytest -n auto # Parallel, faster + +# Specific test file or function +pytest tests/hyrax/test_config_utils.py +pytest tests/hyrax/test_infer.py::test_infer_basic +``` + +### Code Quality +```bash +# Linting and formatting +ruff check --fix . # 10-30 seconds +ruff format . # 10-30 seconds + +# Pre-commit hooks (run all checks) +pre-commit run --all-files # 3-8 minutes + +# Build documentation +sphinx-build -M html ./docs ./_readthedocs -T -E -d ./docs/_build/doctrees +``` + +### CLI Usage +```bash +# Get help +hyrax --help # List all verbs/commands +hyrax --version # Show version +hyrax --help # Help for specific verb + +# Common verbs +hyrax train -c config.toml # Train a model +hyrax infer -c config.toml # Generate latent representations +hyrax umap -c config.toml # Dimensionality reduction +hyrax visualize # Interactive visualization +hyrax save_to_database # Populate vector DB +hyrax lookup # Query vector DB +``` + +## Architecture Overview + +### Plugin Architecture via Registries + +Hyrax uses three primary registries for extensibility: + +1. **MODEL_REGISTRY** (`models/model_registry.py`) + - Maps model names to PyTorch nn.Module classes + - `@hyrax_model` decorator auto-registers models + - Models must implement: `forward()`, `train_step()`, `prepare_inputs()` (formerly `to_tensor()`) + - Automatic shape inference from dataset samples + +2. **DATA_SET_REGISTRY** (`data_sets/data_set_registry.py`) + - Maps dataset names to HyraxDataset classes + - Auto-registration via `__init_subclass__` when subclasses defined + - Base class provides metadata interface, ID generation, catalog access + +3. **VERB_REGISTRY** (`verbs/verb_registry.py`) + - Maps CLI command names to Verb classes + - `@hyrax_verb` decorator registers verbs + - Verbs can be class-based (`run()` and `run_cli()` methods) or function-based + +### Configuration System + +- **TOML-based hierarchical configuration** with strong validation via Pydantic schemas +- **ConfigManager** merges: `hyrax_default_config.toml` + external library configs + user runtime config +- **ConfigDict** enforces all keys must have defaults (prevents silent config bugs) +- Automatic path resolution for relative paths +- Config sections: `[general]`, `[model]`, `[train]`, `[data_set]`, `[download]`, etc. + +### External Plugin Support + +External libraries can provide custom models/datasets/verbs: +1. Set config values like `name = "external_pkg.model.CustomModel"` +2. Provide a `default_config.toml` file in the package root +3. Hyrax's `get_or_load_class()` in `plugin_utils.py` handles dynamic import and config merging + +### Data Flow + +``` +DOWNLOAD (optional) + ↓ Catalog (FITS) → Downloader → Cutout images + manifest.fits + +PREPROCESSING (implicit in dataset) + ↓ Dataset loads raw images → applies transforms → train/validate/test splits + +TRAINING + ↓ train.py: setup_dataset → setup_model → create_trainer → checkpoints + MLflow logs + +INFERENCE + ↓ Model.forward(batch) → latent vectors → batch_*.npy files + batch_index.npy + +VECTOR DB / VISUALIZATION + ↓ ChromaDB for similarity search | UMAP → 2D/3D → Holoviews scatter plot +``` + +### Key Abstractions + +**Hyrax class** (`hyrax.py`): Central orchestration interface wrapping all functionality. Provides both programmatic and CLI access via dynamic `__getattr__` that instantiates verb classes on demand. + +**HyraxDataset** (`data_sets/`): Base class for all datasets +- Subclasses auto-register via `__init_subclass__` +- Must provide metadata interface (fields, catalog data) +- `HyraxImageDataset` mixin provides transform stacking via `_update_transform()` +- Built-in: HSCDataSet, LSSTDataset, FitsImageDataSet, HyraxCifarDataSet, InferenceDataSet + +**Model Registration**: `@hyrax_model` decorator provides: +- Automatic shape inference by sampling dataset +- Standardized save/load via PyTorch state_dict +- Criterion and optimizer loading from config + +**Verb Pattern**: Base `Verb` class with `run()` (programmatic) and `run_cli()` (CLI) methods +- CLI autodiscovery via `all_verbs()` in registry +- Class-based: Infer, Umap, Visualize, SaveToDatabase, Lookup +- Function-based: train, download, prepare, rebuild_manifest + +**Result Chaining**: Verbs create timestamped directories (`YYYYMMDD-HHMMSS--`) +- `find_most_recent_results_dir()` enables automatic chaining between verbs +- InferenceDataSet preserves original dataset config for metadata access + +### Training Infrastructure + +- **PyTorch Ignite-based** distributed training (`pytorch_ignite.py`, `train.py`) +- `setup_dataset()`: Instantiates dataset from config +- `setup_model()`: Instantiates model, infers shape from dataset +- `dist_data_loader()`: Creates distributed data loaders with splits +- `create_trainer()`: Training engine with checkpointing, progress bars +- MLflow for experiment tracking, TensorboardX for metric logging + +## Repository Structure + +### Key Directories +``` +src/hyrax/ # Main package source code + ├── models/ # Model definitions + ├── data_sets/ # Dataset implementations + ├── verbs/ # Command implementations + ├── vector_dbs/ # Vector database implementations (ChromaDB, Qdrant) + └── config_schemas/ # Pydantic schemas for configuration validation +src/hyrax_cli/ # CLI entry point (main.py) +tests/hyrax/ # Unit and integration tests +docs/ # Documentation source files +benchmarks/ # Performance benchmarks (ASV) +example_notebooks/ # Example Jupyter notebooks +``` + +### Important Files +- `pyproject.toml`: Project configuration, dependencies, CLI entry points +- `src/hyrax/hyrax_default_config.toml`: Default configuration template +- `.setup_dev.sh`: Development environment setup script +- `.pre-commit-config.yaml`: Pre-commit hook configuration +- `.github/workflows/`: CI/CD pipeline definitions + +## Code Style and Conventions + +- **Line length**: 110 characters (configured in pyproject.toml) +- **Docstrings**: Required for public classes and functions (enforced by ruff D101-D106) +- **Pre-commit hooks**: Automatically run on commit (ruff, pytest, sphinx-build, jupyter nbconvert) +- **No note-to-self comments**: Custom pre-commit hook prevents placeholder comments + +### Important Architectural Conventions + +1. **Immutable Config**: ConfigDict prevents runtime mutations; all keys must have defaults +2. **Timestamped Results**: Every verb execution creates unique directory preventing overwrites +3. **Metadata Preservation**: InferenceDataSet stores original dataset config to maintain catalog access +4. **Automatic Registration**: Use decorators (`@hyrax_model`, `@hyrax_verb`) or `__init_subclass__` - no manual registration +5. **Batch Indexing**: Inference results include `batch_index.npy` mapping object_ids → batch files +6. **Transform Stacking**: HyraxImageDataset uses `_update_transform()` to compose torchvision transforms +7. **Distributed Training**: PyTorch Ignite's `idist.auto_dataloader()` abstracts single/multi-GPU execution +8. **External Library Support**: Config system detects `name = "pkg.Class"` and auto-loads `pkg/default_config.toml` + +## Testing Conventions + +- **End-to-end tests** in `test_e2e.py` parametrized across model/dataset combinations +- **Test markers**: `@pytest.mark.slow` for long-running tests (skipped in pre-commit and CI) +- **Test fixtures** in `tests/hyrax/conftest.py` provide shared setup +- **Sample data**: Uses Pooch for reproducible downloads from Zenodo DOIs +- **Pre-commit**: Runs fast tests only: `pytest -n auto --cov=./src -m 'not slow'` + +## CI/CD + +- **Testing**: `testing-and-coverage.yml` runs on PRs and main branch +- **Smoke test**: `smoke-test.yml` runs daily +- **Documentation**: `build-documentation.yml` builds docs +- **Benchmarks**: ASV benchmarks via `asv-*.yml` workflows +- **Pre-commit**: Automated via `pre-commit-ci.yml` + +## Adding New Components + +### Adding a New Model +1. Subclass `torch.nn.Module` in `src/hyrax/models/` +2. Add `@hyrax_model` decorator with unique name +3. Implement: `forward()`, `train_step()`, `prepare_inputs()` +4. Available via CLI: `hyrax train -c config.toml` (with `model.name = "YourModelName"`) + +### Adding a New Dataset +1. Subclass `HyraxDataset` in `src/hyrax/data_sets/` +2. Set `_name` class attribute (triggers auto-registration) +3. Implement: `__len__()`, `__getitem__()`, metadata interface +4. For images, subclass `HyraxImageDataset` to get transform stacking + +### Adding a New Verb +1. Create class in `src/hyrax/verbs/` with `run()` and optionally `run_cli()` +2. Add `@hyrax_verb("verb_name")` decorator +3. Implement `setup_parser(parser)` class method for CLI argument parsing +4. Set `add_parser_kwargs` class attribute for help text +5. Available via CLI: `hyrax verb_name [args]` + +## Troubleshooting + +- **Import errors**: Ensure `pip install -e .'[dev]'` completed successfully +- **Network timeouts**: Retry installation multiple times (3-5 attempts may be needed) +- **CLI not found**: Verify with `pip list | grep hyrax` +- **Tests failing**: Check virtual environment and dependencies +- **Pre-commit issues**: Run `pre-commit install` if hooks not working + +## Performance Notes + +- Vector database operations can be slow with large datasets +- ChromaDB performance degrades with vectors >10,000 elements +- UMAP fitting limited to 1024 samples by default for performance +- Benchmarks available in `benchmarks/` directory (run with `asv` tool)