Skip to content

Conversation

@zzzzwwjj
Copy link
Collaborator

@zzzzwwjj zzzzwwjj commented Oct 23, 2025

What this PR does / why we need it?

Add aclgraph developer guide.

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a developer guide for ACLGraph. The documentation is a good start, but it contains several grammatical errors and unclear phrasing that could be improved for better readability. I've provided a detailed suggestion to revise the text for clarity and correctness. Addressing these will make the guide much more helpful for developers.

Comment on lines 5 to 43
When in LLM inference, each token requires nearly thousand operator executions, and when CPU launching operators are slower than GPU, it will cause host bound. In severe cases, the GPU will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.

```
eager mode:

cpu: | launch op1 | launch op2 | launch op3 | launch op4 | launch op5 |

gpu: | run op1 |free| run op2 |free| run op3 |free| run op4 |free| run op5 |

| <----- total time -----> |

graph mode:

cpu: | launch graph |

gpu: | run op1 | run op2 | run op3 | run op4 | run op5 |

| <----- total time -----> |

```

## How to use ACLGraph?

ACLGraph is enabled by default in V1 Engine, just set to use V1 Engine is enough.

## How it works?

In short, graph mode works in two steps: **capture and replay**. When engine starts, we will capture all of the ops in model forward and save it as a graph, and when req come in, we just replay the graph on gpus, and waiting for result.

But in reality, graph mode is not that simple.

### Padding and Grading

Due to graph can only replay the ops captured before, without doing tiling and checking graph input, so we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency.

Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, it will greatly reduce redundant computing, but when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is:
1. Set a threshold;
2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
3. Capture multiple graphs within a range below the threshold;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This section contains several grammatical errors, typos, and awkward phrases that make it difficult to read and understand. For a developer guide, clarity is crucial. I've suggested a rewrite of this section to improve readability and correctness. Notably, 'Grading' in the subheading is likely a typo for 'Bucketing' or a similar term.

During LLM inference, each token requires nearly a thousand operator executions. When the CPU is slower at launching operators than the GPU is at executing them, the process can become host-bound. In severe cases, the GPU may be idle for more than half of the time. To solve this problem, we use graph-based execution in LLM inference.

eager mode:

cpu: | launch op1 | launch op2 | launch op3 | launch op4 | launch op5 |

gpu: | run op1 |free| run op2 |free| run op3 |free| run op4 |free| run op5 |

 | <-----                           total time                                 -----> |

graph mode:

cpu: | launch graph |

gpu: | run op1 | run op2 | run op3 | run op4 | run op5 |

 | <-----                    total time                      -----> |

## How to use ACLGraph?

ACLGraph is enabled by default in the V1 Engine. To use it, simply ensure you are using the V1 Engine.

## How it works?

In short, graph mode works in two steps: **capture and replay**. When the engine starts, we capture all operations in the model's forward pass and save them as a graph. When a request arrives, we simply replay the graph on the GPUs and wait for the result.

But in reality, graph mode is not that simple.

### Padding and Bucketing

Since a graph can only replay operations that were captured beforehand, it cannot perform dynamic actions like tiling or input validation. Therefore, we must ensure the consistency of the graph's input. However, the model's input shape depends on the requests scheduled by the Scheduler, so we cannot guarantee this consistency.

One solution is to capture the graph for the largest possible shape and pad all model inputs to match it. However, this introduces significant redundant computation and degrades performance. A better approach is to capture multiple graphs with different shapes and pad the model input to the nearest captured graph shape. This greatly reduces redundant computation. The trade-off is that when `max_num_batched_tokens` is very large, the number of graphs to capture also becomes very large. We also know that for large input tensor shapes, the computation time is long, making graph mode less beneficial. Therefore, the strategy is as follows:
1. Set a threshold.
2. When `num_scheduled_tokens` is larger than the threshold, use `eager_mode`.
3. Capture multiple graphs for a range of sizes below the threshold.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant