Canonical reference for AI coding assistants working on Hyrax. Tool-specific files
(CLAUDE.md, .github/copilot-instructions.md) contain only tool-specific overrides
and reference this file for shared guidance. Edit this file for changes that should
apply to all AI assistants; edit tool-specific files only for tool-specific behavior.
Hyrax is a low-code, model-agnostic platform for machine learning on large astronomical imaging surveys. It is built on PyTorch and PyTorch Ignite and handles the boilerplate around downloading cutouts, building latent representations, interactive visualization, and anomaly detection so astronomers can focus on science.
CRITICAL: Always keep these design principles in mind when making changes to Hyrax.
"Configuration OR Code" — Hyrax uses a deliberate three-tier system:
- Invisible — sensible defaults handle it; the user never thinks about it.
- Config value — the user sets a TOML key and Hyrax does the rest.
- Write code — when the config system is not enough, the user writes a class or function and points the config at it via an import path.
Jupyter notebooks are the primary interface. The CLI (hyrax command) is the
secondary interface, intended for HPC / Slurm batch jobs. The CLI should be able to do
everything notebooks can. Also, design hyrax such that any dataset or model class
(or other customization) should be authorable in a notebook, and moved to an external
class later.
Make Easy Things Easy, Hard Things Possible
- Default workflows should "just work": Common use cases should require minimal configuration, but in cases where configuration is necessary we do require it so the user is not surprised (e.g. Which ML model is running?)
- Progressive complexity: Simple tasks should be simple; advanced features available when needed
- Sensible defaults: Default configurations in
hyrax_default_config.tomlshould handle common scenarios - Extensibility without complexity: Advanced users can extend with custom models, datasets, and verbs
- Clear extension points: Well-documented base classes (
Verb, model base classes, dataset classes) - Avoid adding Verbs: Only add a new verb if specifically told to do so.
- Avoid adding new configs: Too many configs present a harder learning curve for the user. If you must add a config, prefer a default that works for 90% of use cases.
- Remember our users and extenders ARE NOT Software Engineers: Externally facing notebook interfaces and even user-defined (dataset, model) classes need to be written so they can be understood by hyrax users. The vast majority of hyrax users can write python code in a single file or notebook, but don't really understand classes, multi-file projects, or anything more complex.
Results directories are the backbone of reproducibility. Each run creates a
timestamped directory (YYYYMMDD-HHMMSS-<verb>-<uid>) under results/ containing
model weights, config snapshots, and MLflow tracking data.
- Configuration as documentation: Config files serve as complete records of how experiments were run
- Version everything: Track model versions, data versions, and configuration versions
- Manifest files: Maintain manifests of downloaded data and processed results
- Deterministic defaults: Random seeds and other sources of variability should be configurable
- Take your data to go: Items in results directories should be self-contained and easy for a scientific user to examine outside of hyrax.
- ONNX export: Support model serialization for long-term reproducibility
Target scale: 10M–100M objects on a unix filesystem.
Document current behavior. When migrating away from old patterns, use clear error messages to guide users rather than silently supporting legacy behavior. When writing documentation, prefer compact inspirational examples to demonstrate the breadth of the framework.
Smooth and Legible Migration When APIs Change
- Clear deprecation warnings: When changing APIs, provide helpful deprecation messages
- Error guided Migration: Documentation tells how the current thing works. Errors explain what documentation to follow to move from old to new.
- Backward compatibility when possible: Maintain compatibility or provide clear upgrade path
Leave space for these to be implemented someday by keeping the "right now" invariants but DO NOT IMPLEMENT THE ASPIRATIONAL GOAL
Hyrax will someday support non-pytorch ML frameworks
- Right now we keep all ML tensors in numpy format until the moment PyTorch needs them
- Right now all verb and dataset classes communicate in numpy format over their interfaces
Someday there will be an ecosystem of datasets and models easily selectable by the user
- Right now Dataset classes should work with each other a-la-carte via DataProvider
- Most dataset classes will be external libraries, but not many examples exist presently.
Someday we will support iterable datasets
- Right now Datasets and DataProvider always have a length and map-style access by index, and are presumed to fit in memory.
- Future iterable datasets will have Datasets and DataProvider as a finite and loadable subset of an infinite data stream
When changing code, ensure that the current assumptions of the change appear to have always been true. Leave code better than you find it over keeping old assumptions around.
- Python ≥ 3.11 (see
pyproject.tomlrequires-python) - Create a conda env:
conda create -n hyrax python=3.11 && conda activate hyrax - Clone and install:
git clone https://github.com/lincc-frameworks/hyrax.git && cd hyrax - Run the setup script:
echo 'y' | bash .setup_dev.sh- Installs the package in editable mode with dev extras
- Installs pre-commit hooks
- Alternative manual install:
pip install -e .'[dev]' pre-commit install
# Fast tests (default suite)
python -m pytest -m "not slow"
# Slow / integration tests
python -m pytest -m "slow"
# All tests
python -m pytest
# Parallel tests
python -m pytest -n auto
# Lint and format (let the linter fix style — do not hand-tune)
ruff check src/ tests/
ruff format src/ tests/
# Pre-commit (runs ruff, mypy stubs, trailing whitespace, etc.)
pre-commit run --all-files
# Build docs
sphinx-build -M html ./docs ./_readthedocssrc/hyrax/ Main package
src/hyrax/models/ Model definitions and MODEL_REGISTRY
src/hyrax/datasets/ Dataset implementations and DATASET_REGISTRY
src/hyrax/verbs/ CLI verb implementations and VERB_REGISTRY
src/hyrax/config_schemas/ Pydantic schemas (experimental, data_request only)
src/hyrax/vector_dbs/ ChromaDB / Qdrant integrations
src/hyrax/downloadCutout/ Cutout downloading utilities
src/hyrax_cli/ CLI entry point (main.py)
tests/hyrax/ Test suite
docs/ Sphinx documentation sources
example_notebooks/ Jupyter notebook examples
benchmarks/ ASV performance benchmarks
Key files:
| File | Purpose |
|---|---|
pyproject.toml |
Project metadata, dependencies, ruff/pytest config |
src/hyrax/hyrax_default_config.toml |
Default configuration template |
src/hyrax/hyrax.py |
Main Hyrax class — config management, verb dispatch |
src/hyrax/config_utils.py |
ConfigManager, config merging, results directory creation |
src/hyrax/plugin_utils.py |
Dynamic class loading for external plugins |
.setup_dev.sh |
Development environment bootstrap |
Hyrax discovers components through three registries:
- Decorator:
@hyrax_model - Models must inherit from
torch.nn.Moduleand implement__init__,forward,train_batch, andprepare_inputs. - The decorator wires up save/load, optimizer, and criterion handling.
- Built-in:
HyraxAutoencoder,HyraxAutoencoderV2,HyraxCNN,SimCLR,ImageDCAE,HSCAutoencoder,HSCDCAE,HyraxLoopback - External plugins supported — use a fully qualified import path in the config
(e.g.
model.name = "my_pkg.my_module.MyModel").
- Registration: automatic via
HyraxDataset.__init_subclass__ - Built-in:
HyraxCifarDataset,HSCDataset,LSSTDataset,DownloadedLSSTDataset,FitsImageDataset,HyraxRandomDataset,HyraxCSVDataset - Utility/result classes:
ResultDataset,InferenceDataset - External plugins supported — same import-path mechanism as models.
- Decorator:
@hyrax_verb; base class:Verb - New verbs must be class-based: subclass
Verb, implementsetup_parser,run_cli, andrun. - Some legacy verbs (download, prepare, rebuild_manifest) are function-based in
hyrax.py. Leave these alone; do not add new function-based verbs. - Verbs are internal only — there is no public plugin system for external verb registration. External extensions register through models and datasets only.
Configuration is TOML-based. Resolution order:
- Explicit file via
--runtime-config/-c(CLI) orconfig_fileparameter (API) hyrax_config.tomlin the current working directory- Packaged
hyrax_default_config.toml
ConfigManager deep-merges user config over defaults (including external library
defaults discovered automatically). The runtime config is a plain mutable dict —
code reads and writes it freely at runtime via ConfigManager.set_config().
TOML has no None. Hyrax uses false as a sentinel meaning "not set / use default
behavior." Code that reads these keys must treat the boolean False as None.
Pydantic validation exists only for [data_request] (config_schemas/data_request.py)
due to that section's complexity with nested dictionaries. Do not add Pydantic validation
to other config sections — the rest of the config is validated by checking keys against
defaults, not by Pydantic schemas.
Note: ConfigDict appearing in config_schemas/ is Pydantic's ConfigDict, not a
custom Hyrax wrapper. The runtime config itself is an ordinary dict.
User configs carry a top-level config_version = N scalar. Hyrax stamps the
current version into hyrax_default_config.toml and uses
src/hyrax/config_migrations/ to upgrade older user configs forward on
load, before the merge step. Legacy configs without a config_version field
are assumed to be the latest version.
Each migration step lives in its own descriptively-named module (e.g.
001_rename_model_inputs_to_data_request.py) and self-registers via the @migration_step
decorator. CURRENT_CONFIG_VERSION is auto-derived from the highest registered
migration — do not bump it manually.
When you rename or restructure a config key, you must:
- Create
src/hyrax/config_migrations/migrations/00N_description.py(e.g.003_move_learning_rate.py). Decorate the migration function with@migration_step(from_version=N, key_renames={...}). Import the decorator and helpers (rename_table,move_key) fromhyrax.config_migrations.migration_utils. The module is auto-discovered viapkgutil— no import line needed elsewhere.CURRENT_CONFIG_VERSIONandconfig_versioninhyrax_default_config.tomlare both stamped automatically at runtime — do not bump either manually. - Add a unit test to
tests/hyrax/test_config_migrations.pycovering both the "legacy config triggers migration" and "clean current-version config is a no-op" cases.
Configs declaring a config_version higher than the installed Hyrax supports
are refused with a RuntimeError pointing at pip install -U hyrax.
High-level pipeline:
- Download — fetch cutouts from survey services; track progress in a manifest file.
- Prepare — apply transforms, build dataset splits (train / validate / test).
- Train — fit a model; results written to a timestamped results directory.
- Infer — run a trained model over a dataset; save latent representations.
- UMAP — reduce dimensionality of latent vectors for visualization.
- Visualize — interactive exploration in Jupyter (holoviews / bokeh).
- Vector DB — store and query latent vectors (ChromaDB or Qdrant).
Each verb that produces output creates its own timestamped results directory.
- File naming:
tests/hyrax/test_<name>.py - Markers:
slowfor integration / E2E tests; unmarked tests are fast. - Default test run:
python -m pytest -m "not slow" - Test data:
HyraxCifarDataset(CIFAR-10 via torchvision),HyraxRandomDataset, and Pooch/Zenodo-hosted files for slow tests. - Parallel execution:
pytest -n auto(pytest-xdist). - E2E tests exercise full pipelines (train → infer → umap → visualize).
- Spelling: Use
Dataset(single word, lowercase 's') for class names and identifiers. In snake_case contexts, usedataset. - Timestamped results dirs —
YYYYMMDD-HHMMSS-<verb>-<uid>underresults/. Each run snapshots its config asruntime_config.tomlinside the directory. - Batch indexing — data loaders use PyTorch's standard batch dimension (dim 0).
- Transform stacking —
HyraxImageDatasetmixin stacks torchvision transforms viaCompose; each_update_transformcall wraps the existing stack. - Distributed training — via PyTorch Ignite's
idistutilities (auto_model,auto_dataloader). SupportsDataParallelandDistributedDataParallel. - Note-to-self hook — developers leave a grep-able four-character marker
(
xcrepeated twice) for unfinished work. A pre-commit hook blocks commits containing this marker. - Line length — 110 characters (
ruffenforces this). - Manifest files — FITS binary tables tracking download state. These are a known compromise / anti-pattern, not a design goal. They exist because there was no better option at the time. If extending manifest files seems like the right solution, ask the user for clarification first.
- Models defined in
src/hyrax/models/ - Built-in models:
HyraxAutoencoder,HyraxCNN - Model registry system automatically discovers models
- General model configuration in
[model]section of config files - Configurations for specific models in
[model.<ModelName>]sections - Training via
hyrax traincommand - Export to ONNX format supported
- Data loaders in
src/hyrax/datasets/ - Built-in datasets:
HSCDataset,HyraxCifarDataset,LSSTDataset,FitsImageDataset - Dataset splits: train/validation/test controlled by config
- Configuration in
[data_set]section - Default data directory:
./data/ - Sample data includes HSC1k dataset for testing
- Implementations in
src/hyrax/vector_dbs/ - Supported: ChromaDB, Qdrant
- Commands:
save_to_database,database_connection - Configuration in
[vector_db]section
- Jupyter integration via
holoviews,bokehfor visualizations - Interactive visualization via
hyrax visualizeverb - Pre-executed examples in
docs/pre_executed/
- Main workflows in
.github/workflows/ - Testing:
testing-and-coverage.ymlruns on PRs and main branch - Smoke test:
smoke-test.ymlruns daily - Documentation:
build-documentation.ymlbuilds docs - Benchmarks: ASV benchmarks via
asv-*.ymlworkflows - Pre-commit: Automated via
pre-commit-ci.yml
- Import errors: Ensure
pip install -e .'[dev]'completed successfully - Network timeouts during install: Retry installation multiple times, may require 3-5 attempts due to PyPI connectivity issues
- ReadTimeoutError: Common during installation - wait 1-2 minutes and retry the same pip command
- CLI not found: Verify installation with
pip list | grep hyrax - Tests failing: Check if in virtual environment and dependencies installed
- Pre-commit issues: Run
pre-commit installif hooks not working - Permission issues: Use
--userflag with pip if encountering permission errors - Virtual environment: Always use conda/venv to avoid system Python conflicts
- Vector database operations can be slow with large datasets
- Benchmarks available in
benchmarks/directory (run withasvtool) - Use
--timeoutparameters appropriately for long-running operations - ChromaDB performance degrades with vectors >10,000 elements
- UMAP fitting limited to 1024 samples by default for performance
- Benchmark tests include timing for CLI help commands, object construction, and vector DB operations