[WIP][docs] add aclgraph developer guide #3683

zzzzwwjj · 2025-10-23T12:46:45Z

What this PR does / why we need it?

Add aclgraph developer guide.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@c9461e0

gemini-code-assist

Code Review

This pull request adds a developer guide for ACLGraph. The documentation is a good start, but it contains several grammatical errors and unclear phrasing that could be improved for better readability. I've provided a detailed suggestion to revise the text for clarity and correctness. Addressing these will make the guide much more helpful for developers.

gemini-code-assist · 2025-10-23T12:48:06Z

docs/source/developer_guide/feature_guide/acl_graph.md

+When in LLM inference, each token requires nearly thousand operator executions, and when CPU launching operators are slower than GPU, it will cause host bound. In severe cases, the GPU will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
+
+```
+eager mode:
+
+cpu: |  launch op1  |  launch op2  |  launch op3  |  launch op4  |  launch op5  |
+
+gpu:                | run op1 |free| run op2 |free| run op3 |free| run op4 |free| run op5 |
+
+     | <-----                           total time                                 -----> |
+
+graph mode:
+
+cpu: |  launch graph  |
+
+gpu:                  | run op1 | run op2 | run op3 | run op4 | run op5 |
+
+     | <-----                    total time                      -----> |
+
+```
+
+## How to use ACLGraph?
+
+ACLGraph is enabled by default in V1 Engine, just set to use V1 Engine is enough.
+
+## How it works?
+
+In short, graph mode works in two steps: **capture and replay**. When engine starts, we will capture all of the ops in model forward and save it as a graph, and when req come in, we just replay the graph on gpus, and waiting for result.
+
+But in reality, graph mode is not that simple.
+
+### Padding and Grading
+
+Due to graph can only replay the ops captured before, without doing tiling and checking graph input, so we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency.
+
+Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, it will greatly reduce redundant computing, but when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is:
+1. Set a threshold;
+2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
+3. Capture multiple graphs within a range below the threshold;


This section contains several grammatical errors, typos, and awkward phrases that make it difficult to read and understand. For a developer guide, clarity is crucial. I've suggested a rewrite of this section to improve readability and correctness. Notably, 'Grading' in the subheading is likely a typo for 'Bucketing' or a similar term.

During LLM inference, each token requires nearly a thousand operator executions. When the CPU is slower at launching operators than the GPU is at executing them, the process can become host-bound. In severe cases, the GPU may be idle for more than half of the time. To solve this problem, we use graph-based execution in LLM inference.

eager mode:

cpu: | launch op1 | launch op2 | launch op3 | launch op4 | launch op5 |

gpu: | run op1 |free| run op2 |free| run op3 |free| run op4 |free| run op5 |

| <----- total time -----> |

graph mode:

cpu: | launch graph |

gpu: | run op1 | run op2 | run op3 | run op4 | run op5 |

| <----- total time -----> |

## How to use ACLGraph? ACLGraph is enabled by default in the V1 Engine. To use it, simply ensure you are using the V1 Engine. ## How it works? In short, graph mode works in two steps: **capture and replay**. When the engine starts, we capture all operations in the model's forward pass and save them as a graph. When a request arrives, we simply replay the graph on the GPUs and wait for the result. But in reality, graph mode is not that simple. ### Padding and Bucketing Since a graph can only replay operations that were captured beforehand, it cannot perform dynamic actions like tiling or input validation. Therefore, we must ensure the consistency of the graph's input. However, the model's input shape depends on the requests scheduled by the Scheduler, so we cannot guarantee this consistency. One solution is to capture the graph for the largest possible shape and pad all model inputs to match it. However, this introduces significant redundant computation and degrades performance. A better approach is to capture multiple graphs with different shapes and pad the model input to the nearest captured graph shape. This greatly reduces redundant computation. The trade-off is that when `max_num_batched_tokens` is very large, the number of graphs to capture also becomes very large. We also know that for large input tensor shapes, the computation time is long, making graph mode less beneficial. Therefore, the strategy is as follows: 1. Set a threshold. 2. When `num_scheduled_tokens` is larger than the threshold, use `eager_mode`. 3. Capture multiple graphs for a range of sizes below the threshold.

github-actions · 2025-10-23T13:19:27Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: zzzzwwjj <[email protected]>

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

github-actions bot added the documentation Improvements or additions to documentation label Oct 23, 2025

zzzzwwjj force-pushed the aclgraph_doc branch from fc7816b to 6e54618 Compare October 25, 2025 10:39

[docs] add aclgraph developer guide

f4e1030

Signed-off-by: zzzzwwjj <[email protected]>

zzzzwwjj force-pushed the aclgraph_doc branch from 6e54618 to f4e1030 Compare October 25, 2025 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP][docs] add aclgraph developer guide #3683

[WIP][docs] add aclgraph developer guide #3683

zzzzwwjj commented Oct 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[WIP][docs] add aclgraph developer guide #3683

Are you sure you want to change the base?

[WIP][docs] add aclgraph developer guide #3683

Conversation

zzzzwwjj commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zzzzwwjj commented Oct 23, 2025 •

edited by github-actions bot

Loading