Add Claude skills for corrrectness validation by haok1402 · Pull Request #13 · mlc-ai/Pith-Train

haok1402 · 2026-04-04T19:33:53Z

Quick regression check for code changes — run 15 training steps on both the base and feature branch, compare loss curves, and flag if anything diverged.

What's in here

A Claude skill (.claude/skills/correctness-validation/) that walks the agent through the full workflow: figure out which models are affected, set up data, run both branches via git worktree, and compare the logs.
Per-model setup and validation scripts for DeepSeek-V2-Lite (4 GPU) and Qwen3-30B-A3B (16 GPU), plus SLURM-aware launch scripts.
compare.py parses stdout logs and checks cross-entropy loss, load-balance loss, and gradient norm step-by-step against a configurable tolerance (default 1e-3).
save_interval and save_location in TrainingCfg are now optional (default None). Setting save_location without save_interval lets you load a pretrained checkpoint without ever writing one back — exactly what the validation scripts need to run a quick 15-step check without cluttering the checkpoint directory.

…unspecified to support short validation runs

gemini-code-assist

Code Review

This pull request introduces a correctness validation framework to ensure code changes do not negatively impact training metrics. It includes a new Claude skill, model-specific setup and validation scripts for DeepSeek-V2-Lite and Qwen3-30B-A3B, and a utility to compare training logs. Additionally, the training configuration was modified to allow optional checkpointing. Feedback highlights a potential issue in the worktree setup documentation and recommends using explicit error handling instead of assertions for configuration checks.

.claude/skills/correctness-validation/SKILL.md

pithtrain/tasks/pretrain_language_model.py

MasterJH5574

LGTM, good to have the first skill 🎉

haok1402 added 2 commits April 4, 2026 15:29

Revise the interface to skip checkpoint saving if save_interval left …

785edd4

…unspecified to support short validation runs

Add the claude skills for correctness validation

6fe932e

gemini-code-assist bot reviewed Apr 4, 2026

View reviewed changes

.claude/skills/correctness-validation/SKILL.md Show resolved Hide resolved

pithtrain/tasks/pretrain_language_model.py Show resolved Hide resolved

MasterJH5574 approved these changes Apr 4, 2026

View reviewed changes

MasterJH5574 merged commit 84adf58 into mlc-ai:main Apr 4, 2026
1 check passed

haok1402 deleted the 0404-claude-skills branch April 4, 2026 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Claude skills for corrrectness validation#13

Add Claude skills for corrrectness validation#13
MasterJH5574 merged 2 commits intomlc-ai:mainfrom
haok1402:0404-claude-skills

haok1402 commented Apr 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

MasterJH5574 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haok1402 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in here

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

MasterJH5574 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haok1402 commented Apr 4, 2026 •

edited

Loading