Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/developer_guide/feature_guide/ACL_Graph.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ Obviously, we can solve this problem by capturing the biggest shape and padding

```

In vLLM, these thresholds are set by `cudagraph_capture_sizes`. The default capture sizes are like `[1,2,4,8,16,24,32,...,max_capture_size]`. You can customize capture sizes to get fine-grained control over performance. For example, we can set `cudagraph_capture_sizes` as `[1,2,4,6,12,18]` when running Qwen3-235B on decode node in large ep.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is a helpful addition. However, to prevent potential confusion for developers, I suggest a couple of clarifications:

  1. Confusing Parameter Name: This document is about ACL Graphs for Ascend NPUs, but the parameter is named cudagraph_capture_sizes. This is potentially misleading. It would be good to add a note clarifying that this parameter applies to both CUDA and ACL graphs.
  2. Undefined Acronym: The term "ep" in "large ep" is ambiguous. Please spell it out (e.g., "expert parallelism") for clarity.

Given this is developer-facing documentation where clarity is crucial to prevent configuration errors, I've provided a suggestion to improve it.

Suggested change
In vLLM, these thresholds are set by `cudagraph_capture_sizes`. The default capture sizes are like `[1,2,4,8,16,24,32,...,max_capture_size]`. You can customize capture sizes to get fine-grained control over performance. For example, we can set `cudagraph_capture_sizes` as `[1,2,4,6,12,18]` when running Qwen3-235B on decode node in large ep.
In vLLM, these thresholds are set by `cudagraph_capture_sizes`. Note that this parameter applies to both CUDA graphs and Ascend's ACL graphs. The default capture sizes are like `[1,2,4,8,16,24,32,...,max_capture_size]`. You can customize capture sizes to get fine-grained control over performance. For example, we can set `cudagraph_capture_sizes` as `[1,2,4,6,12,18]` when running Qwen3-235B on a decode node in a large expert parallelism (EP) setup.


### Piecewise and Full graph

Due to the increasing complexity of the attention layer in current LLM, we can't ensure all types of attention can run in graph. In MLA, prefill_tokens and decode_tokens have different calculation method, so when a batch has both prefills and decodes in MLA, graph mode is difficult to handle this situation.
Expand Down
Loading