Skip to content

Add Claude skills for corrrectness validation#13

Merged
MasterJH5574 merged 2 commits intomlc-ai:mainfrom
haok1402:0404-claude-skills
Apr 4, 2026
Merged

Add Claude skills for corrrectness validation#13
MasterJH5574 merged 2 commits intomlc-ai:mainfrom
haok1402:0404-claude-skills

Conversation

@haok1402
Copy link
Copy Markdown
Collaborator

@haok1402 haok1402 commented Apr 4, 2026

Quick regression check for code changes — run 15 training steps on both the base and feature branch, compare loss curves, and flag if anything diverged.

What's in here

  • A Claude skill (.claude/skills/correctness-validation/) that walks the agent through the full workflow: figure out which models are affected, set up data, run both branches via git worktree, and compare the logs.
  • Per-model setup and validation scripts for DeepSeek-V2-Lite (4 GPU) and Qwen3-30B-A3B (16 GPU), plus SLURM-aware launch scripts.
  • compare.py parses stdout logs and checks cross-entropy loss, load-balance loss, and gradient norm step-by-step against a configurable tolerance (default 1e-3).
  • save_interval and save_location in TrainingCfg are now optional (default None). Setting save_location without save_interval lets you load a pretrained checkpoint without ever writing one back — exactly what the validation scripts need to run a quick 15-step check without cluttering the checkpoint directory.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a correctness validation framework to ensure code changes do not negatively impact training metrics. It includes a new Claude skill, model-specific setup and validation scripts for DeepSeek-V2-Lite and Qwen3-30B-A3B, and a utility to compare training logs. Additionally, the training configuration was modified to allow optional checkpointing. Feedback highlights a potential issue in the worktree setup documentation and recommends using explicit error handling instead of assertions for configuration checks.

Copy link
Copy Markdown
Member

@MasterJH5574 MasterJH5574 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good to have the first skill 🎉

@MasterJH5574 MasterJH5574 merged commit 84adf58 into mlc-ai:main Apr 4, 2026
1 check passed
@haok1402 haok1402 deleted the 0404-claude-skills branch April 4, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants