Skip to content

[ENHANCEMENT] Add a minimal "quickstart" example for single-GPU / CPU local runs #3994

@nathon-lee

Description

@nathon-lee

Background

Megatron-LM and Megatron Core are geared toward large-scale, multi-GPU and multi-node training. While this is essential for production and research at scale, it raises the entry barrier for new contributors and developers who don't have access to large GPU clusters. A minimal, local quickstart that runs on a single GPU (and provides a CPU fallback) would make the project much more accessible: it enables quicker development cycles, easier PR testing for external contributors, and a fast sanity-check path in CI.

Proposed change

  • Add an examples/quickstart/ directory containing:
    • train_small_gpt.py: a simplified training wrapper that runs a tiny GPT-like model (few layers, small hidden size) for a configurable number of steps.
    • configs/quickstart_small.yaml: minimal config for model size, dataset (synthetic/random data), optimizer, and runtime settings.
  • Add a doc page docs/get-started/local_quickstart.md with step-by-step instructions for:
    • setting up a Python virtual environment,
    • installing only the minimal dependencies required for the quickstart (torch, numpy, etc.),
    • running the example on a single GPU and on CPU (fallback).
  • Ensure the quickstart does not require TransformerEngine, FP8, or other optional binary dependencies. Use synthetic/random data so no dataset downloads are needed.
  • Optionally, add a lightweight script or Makefile target to run the quickstart easily (e.g., tools/quickstart/run_quickstart.sh).

Acceptance criteria

  • A user can follow docs/get-started/local_quickstart.md and run a full training loop locally on:
    • a single GPU (if available), completing in a short time (e.g., <10 minutes for default steps),
    • or on CPU (with a longer but still reasonable runtime for verification).
  • The example uses synthetic/random input data and does not require external datasets or heavy optional dependencies.
  • The new files are placed under examples/quickstart/ and the docs are linked from the main README.md under "Getting Started".
  • Example is minimal, easy to read, and well-documented; it includes suggested command lines and expected lightweight output.

Implementation notes / suggestions

  • train_small_gpt.py can be a thin wrapper that constructs a minimal model from transformer or a small custom torch.nn.Module that mimics the model shape used by Megatron modules. It should:
    • accept a --device flag (cuda or cpu) and a --steps flag for number of optimization steps,
    • simulate datasets using random tensors to avoid downloads,
    • log a couple of metrics (loss) to stdout so users can verify training progress.
  • Keep the dependency list in docs minimal: torch, numpy, and pyyaml for configs. Mark transformer-engine and other heavy extras as optional for this quickstart.
  • Add a short CI sanity job (optional) that runs the example on CPU to verify the quickstart remains functional. This job can be lightweight and time-limited.

Suggested branch name
doc/quickstart-single-gpu

Labels
documentation, enhancement, good-first-issue

Estimated difficulty
Low

I am willing to submit a PR
I’m happy to implement this and submit a PR that adds examples/quickstart/ and docs/get-started/local_quickstart.md. The PR will include the minimal script(s), config, and documentation. If maintainers prefer a slightly different layout, I can adapt the changes to match repo conventions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions