Skip to content

Conversation

zbloss
Copy link

@zbloss zbloss commented Oct 1, 2025

This pull request introduces infrastructure and documentation improvements to streamline development, testing, and deployment for the HRM repository. The most significant changes are the addition of a multi-stage GPU-enabled Dockerfile, a GitHub Actions CI workflow for Python linting and tests, and expanded installation and usage instructions in the README.md. There are also updates to dataset and training script usage, and minor code cleanups in the evaluation notebook.

Infrastructure & Deployment

  • Added a multi-stage Dockerfile supporting both FlashAttention 2 and 3 for Ampere and Hopper GPUs, including CUDA 12.6, Python 3.12, and optimized dependency installation using uv. This enables reproducible GPU builds for both development and production.
  • Added .dockerignore to exclude unnecessary files and directories (such as caches, data, checkpoints, notebooks, and test artifacts) from Docker build context, reducing image size and improving build performance.

Continuous Integration

  • Introduced a GitHub Actions workflow (.github/workflows/ci.yml) to automatically lint and test the codebase on Python 3.11, 3.12, and 3.13, ensuring code quality and compatibility across multiple Python versions.
  • Specified the default Python version as 3.12 in .python-version for consistent local and CI environments.

Documentation & Usage

  • Expanded the README.md with detailed package structure, installation options (including uv, pip, and Docker), FlashAttention setup instructions, Python API usage example, and updated commands for dataset preparation, training, and evaluation to use the new scripts/ directory and uv run. [1] [2] [3] [4] [5]

Notebook Cleanup

  • Minor import reordering and formatting improvements in arc_eval.ipynb for clarity and consistency. [1] [2]

Issues

Closes #88

@zbloss
Copy link
Author

zbloss commented Oct 1, 2025

I ran examples/02_train_sudoku_extreme.py to confirm the code still works as expected and it does.

WandB Results

I was not able to replicate the results in the paper but I am running on a much smaller GPU so I had to decrease the batch size and learning rate which I believe is the core issue with my results.

@alexander-rakhlin
Copy link

alexander-rakhlin commented Oct 2, 2025

@zbloss were you able to run evaluation of their ARC-2 checkpoint I'm getting errors regarding size mismatch

@zbloss
Copy link
Author

zbloss commented Oct 2, 2025

@zbloss was you able to run evaluation of their ARC-2 checkpoint I'm getting errors re. size mismatch

I did not try to load the existing checkpoints, I couldn't get them to load before these changes with similar issues.

@alexander-rakhlin
Copy link

I did not try to load the existing checkpoints, I couldn't get them to load before these changes with similar issues.

So this checkpoint works after your changes?

@zbloss
Copy link
Author

zbloss commented Oct 2, 2025

No it does not work before or with the changes

@zbloss
Copy link
Author

zbloss commented Oct 4, 2025

@alexander-rakhlin I have opened up a PR to add this model to Huggingface's Transformers library with working checkpoints in safetensor format.

I'm waiting on the 🤗 team to review and approve, so you'll have to pull my fork if you want to use it immediately.

huggingface/transformers#41272

Weights are here:

@alexander-rakhlin
Copy link

@zbloss thank you. I also trained Sudoku and it works just fine, except for invalid puzzles with multiple solutions. Currently, I am training ARC-1. I think I found the reason why their checkpoint fails and will let you know once I verify it.

@alexander-rakhlin
Copy link

@zbloss
#90 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Contribution Guide, Interested in helping

2 participants