Skip to content

Conversation

@b0noI
Copy link

@b0noI b0noI commented May 27, 2025

No description provided.

Slava Kovalevskyi and others added 9 commits May 19, 2025 10:38
…ective as Jules:

I've updated the GoodPut guide based on your feedback.

This update significantly enhances GOODPUT_GUIDE.md to make it more comprehensive and address your specific requests:

- **Generalization:** I've reframed the guide as a general document for GoodPut optimization, using the LLaMA3-1-70B recipe as a case study.
- **Sources of BadPut:** I added a new section detailing common sources of inefficiency, including quantitative examples and references from the Google Cloud blog post.
- **Enhanced Technical Details:**
    - The "Elastic Training Solutions" section now includes more in-depth explanations of the Supervisor components (Sensor, Controller, Actuator), Host Monitors, and their interactions. It also provides more specific links to configuration parameters in `values-supervisor.yaml`, `ksa-setup.yaml`, the supervisor Helm chart, and the NVIDIA Resiliency Extension.
    - The "Optimized Checkpointing Solutions" section offers more detailed descriptions of asynchronous and distributed checkpointing, and the role of GCS FUSE. It includes more specific links to flags in `values.yaml`, GCS configuration in `values-gcs.yaml`, the gcs-fuse Helm chart, and external PyTorch/GCS FUSE documentation.
- **Link Precision:** I've made the links to files within the repository (e.g., `values.yaml`, `README.md`) in the "DIY" and "Goodput Analysis" sections more explicit and accurate.

The guide now provides a more thorough overview of GoodPut concepts, practical implementation details based on the recipe, and pointers to relevant code, configurations, and external resources.
This commit addresses several improvements to the GOODPUT_GUIDE.md:

- Downloads and embeds the previously externally linked image.
- Converts various plain text key references (e.g., `heartbeat_polling_period_s`) and file paths (e.g., `values-supervisor.yaml`, `ksa-setup.yaml`, Helm chart paths) into relative markdown links.
- Integrates the "DIY/Lego blocks" concept into existing sections rather than having a standalone section.
- Adds a new subsection explaining how to use the Supervisor system with a custom model, detailing deployment, configuration, Actuator integration, and checkpointing/resumption needs.
…and add a checkpointing preamble.

This change addresses your request to:
- Extend the detailed, link-rich style of the Supervisor section to other relevant sections, particularly Remediation Strategies and Goodput Analysis.
- Add a preamble to the Optimized Checkpointing section to guide you on choosing between asynchronous, distributed checkpointing, and how these relate to the overall resiliency managed by the Supervisor system.
- Enhance descriptions of various checkpointing parameters to clarify their use and tunability.
- Remove all 'State of Support' sections for brevity.
- Remove the 'Conclusion' section as requested.
This commit addresses several improvements for the GoodPut guide:

1.  **Adds a TLDR Section:**
    A new "TLDR: Recommended Lego Blocks for Your Deployment" section
    has been added after the introduction. This section provides a
    concise, actionable summary of key strategies (Supervisor System,
    Asynchronous Checkpointing, Distributed Checkpointing, GCS with FUSE,
    and Checkpoint Frequency Tuning) to help you quickly understand
    how to improve GoodPut for your own workloads.

2.  **Updates File References to Clickable Links:**
    All plain text references to internal files (e.g., `values.yaml`,
    `values-supervisor.yaml`, `README.md`, Helm chart paths) and relevant
    external documentation URLs within the guide have been converted
    into clickable Markdown links. This improves navigability and makes
    it easier for you to access referenced resources and code.
    Relative paths for internal files have been carefully constructed.

These changes aim to make the guide more user-friendly, easier to navigate,
and more actionable for customers looking to apply these resiliency and
efficiency techniques to their ML training deployments.
This commit significantly refactors the GOODPUT_GUIDE.md to improve its clarity, provide more concrete examples, and enhance its overall impact based on your feedback.

Key changes include:
- Enhanced Introduction: Added a concrete example with calculations to quantify the cost of wasted GPU hours due to low GoodPut.
- Added BadPut Summary Table: Introduced a table at the beginning of the guide, listing common sources of BadPut, their descriptions, and estimated potential GoodPut loss percentages. This provides an upfront overview of challenges.
- Restructured "TLDR: Recommended Lego Blocks": Reordered the "Lego Blocks" to prioritize checkpointing strategies, followed by the supervisor system, as checkpointing often offers significant initial GoodPut improvements. Added quantitative justifications (potential GoodPut recovery) to these key blocks.
- Streamlined Content: Removed the redundant "Understanding Sources of BadPut (Lost Efficiency)" section, as its content was integrated into the new summary table.
- Added Concluding Section: A new section "Tying It All Together: A Holistic Approach" was added to summarize the guide's message and emphasize the synergistic application of the discussed strategies.
- Comprehensive Review: Ensured all examples, references, and internal links remain accurate and contextually appropriate after the restructuring.

These changes aim to make the guide more actionable and persuasive for you when looking to improve your ML training efficiency.
This commit adds a Table of Contents (TOC) to the `GOODPUT_GUIDE.md` document to improve navigation.

The TOC is placed after the initial introductory paragraph and includes links to all major H2 and H3 sections.

Output:
@b0noI b0noI requested a review from Copilot May 27, 2025 17:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive GoodPut guide for large-scale ML training resilience along with significant enhancements and new tests in the ML GoodPut Measurement package.

  • Added a detailed documentation file (GOODPUT_GUIDE.md) explaining strategies for improving ML GoodPut.
  • Introduced and updated source code, tests, and configuration files for the ML GoodPut Measurement package covering caching, metrics, and utility functions.

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
training/a3mega/llama3-1-70b/nemo-pretraining-gke-resiliency/GOODPUT_GUIDE.md New documentation guide for improving ML GoodPut.
ml-goodput-measurement/pyproject.toml New package configuration and dependency specifications.
ml-goodput-measurement/ml_goodput_measurement/tests/goodput_cache_test.py Unit tests for the GoodputCache implementation.
ml-goodput-measurement/ml_goodput_measurement/tests/gcp_metrics_test.py Unit tests for GCP metrics reporting functionality.
ml-goodput-measurement/ml_goodput_measurement/src/goodput_utils.py Utility functions supporting Goodput/Badput calculations and metadata retrieval.
ml-goodput-measurement/ml_goodput_measurement/src/goodput_cache.py Implementation of the GoodputCache with update and retrieval methods.
ml-goodput-measurement/ml_goodput_measurement/src/gcp_metrics.py GCP metrics reporting functionality with batched time series sending.
Other files (LICENSE, CONTRIBUTING.md, CHANGELOG.md, init.py) Packaging, licensing, and contribution guidelines updates.
Comments suppressed due to low confidence (1)

ml-goodput-measurement/ml_goodput_measurement/src/goodput_utils.py:167

  • The function uses a generic substring check ('time') to extract a timestamp from keys, which may cause unintended matches. It is advisable to use a more explicit key name (e.g., 'timestamp') to improve clarity and reduce potential errors.
def get_timestamp_from_log_entry(entry: dict[str, Any]) -> Optional[datetime.datetime]:

@b0noI b0noI marked this pull request as draft May 27, 2025 17:42
@b0noI b0noI requested a review from Copilot May 27, 2025 17:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@b0noI b0noI requested a review from Copilot May 27, 2025 17:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant