Skip to content

Conversation

@mesakhcienet
Copy link
Contributor

@mesakhcienet mesakhcienet commented Oct 20, 2025

Description

Migrate deepseek models of split batch option to use nnx module.

Tests

We use xpk to create tpu cluster and assign workload

Environment

Cluster

TPU type : v6e-32
Number of slices : 4
GKE version : 1.31.11-gke.1036000
Base Image : us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:jax0.7.0-rev1

Image

Build Image command :

bash docker_build_dependency_image.sh DEVICE=tpu MODE=stable_stack BASEIMAGE=us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:jax0.7.0-rev1

Test command

Run Xpk command :

MODEL_NAME="deepseek3-test"
python xpk.py workload create --cluster $CLUSTER_NAME \
 --base-docker-image mesa_maxtext_base_image \
 --workload=$WORKLOAD \
 --tpu-type=${TPU_TYPE} --num-slices=${NUM_SLICES} --max-restarts=10 \
   --on-demand \
  --script-dir=$MAXTEXT_SCRIPT_DIR --command \
   "python3 -m MaxText.train MaxText/configs/base.yml \
   run_name=runner_direct_${idx}  \
   base_output_directory=${BASE_OUTPUT_DIRECTORY}  \
   model_name=${MODEL_NAME} \
   dataset_type=synthetic  \
   async_checkpointing=false  \
   per_device_batch_size=1 \
   metrics_file='metrics.txt'  \
   steps=15  use_batch_split_schedule=true"

Log

Train diff before and after migration here

  • Before migration (from main branch) : link
  • After migration (train from scratch) : link
  • After migration (train with loading previous model from main branch / before migration, with steps argument sets from 15 to 40 ): link

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@gobbleturk
Copy link
Collaborator

Manually adding pull ready here to trigger copybara so its easier for @jesselu-google to review/test

@gobbleturk
Copy link
Collaborator

@jesselu-google has approved and this LGTM

@mesakhcienet mesakhcienet force-pushed the feat/migrate-deepseek-split-batch-to-nnx branch 2 times, most recently from 88e0d16 to 5473e91 Compare October 21, 2025 03:08
Copy link
Collaborator

@bvandermoon bvandermoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mesakhcienet. Generally LGTM. Could you collect before/after profiles from the training you ran so we can compare the HLOs?

@mesakhcienet
Copy link
Contributor Author

Thanks @mesakhcienet. Generally LGTM. Could you collect before/after profiles from the training you ran so we can compare the HLOs?

Sure, is this the training profiles you mentioned sir?
https://diff.googleplex.com/#key=4EtN9XxcFDga

Feel free to tell me if there is any additional information to collect, thank you.

@bvandermoon
Copy link
Collaborator

Thanks @mesakhcienet. Generally LGTM. Could you collect before/after profiles from the training you ran so we can compare the HLOs?

Sure, is this the training profiles you mentioned sir? https://diff.googleplex.com/#key=4EtN9XxcFDga

Feel free to tell me if there is any additional information to collect, thank you.

Thanks @mesakhcienet. The profiles are a bit different than these logs. With the profiles, we can tell if the HLOs are the same before/after this change. Followed up with a separate thread offline on this

@mesakhcienet mesakhcienet force-pushed the feat/migrate-deepseek-split-batch-to-nnx branch from 5473e91 to 1f87365 Compare October 22, 2025 06:19
@mesakhcienet mesakhcienet force-pushed the feat/migrate-deepseek-split-batch-to-nnx branch from 1f87365 to 7825f79 Compare October 22, 2025 06:23
@Shuang-cnt
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants