-
Notifications
You must be signed in to change notification settings - Fork 561
Adopt inductor fusion and define quantization fusion pass #4168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Icey <[email protected]>
Signed-off-by: Icey <[email protected]>
Signed-off-by: Icey <[email protected]>
Signed-off-by: Icey <[email protected]>
Signed-off-by: Icey <[email protected]>
Signed-off-by: Icey <[email protected]>
Signed-off-by: wxsIcey <[email protected]>
Signed-off-by: wxsIcey <[email protected]>
Signed-off-by: wxsIcey <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: wxsIcey <[email protected]>
|
Currently, operator fusion has been achieved through pattern matching using inductors. Using aot-autograd could be a future work, but it has been found that using aot-autograd causes accuracy issues. @whx-sjtu Would you be willing to review it? |
whx-sjtu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Finally we make it to utilize pattern_matcher of inductor to fuse our add_rms_norm_quant kernel into Fx graph. The whole idea looks good to me with some questions about details as reviewed following.
| return shape_list | ||
|
|
||
|
|
||
| class AscendAdaptor(CompilerInterface): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name AscendAdaptor is too vague; I suggest a more specific one like AscendCompiler.
| Pattern for AddRMSNormQuant fusion. | ||
| """ | ||
| output = torch.ops.npu.npu_add_rms_norm(rms_norm_input, residual, | ||
| rms_norm_weight, 1e-6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of fixed to 1e-6, the eps should be defined as a static variable of AddRMSNormQuantPattern, with different values of eps corresponding to different pattern objects. Some models might use different eps like 1e-5.
|
|
||
| def __init__(self, vllm_config): | ||
| super().__init__(vllm_config) | ||
| self.patterns: PatternMatcherPass = PatternMatcherPass( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of self.patterns is a bit confusing here. It should be named as something like self.pattern_match_pass.
| arg_dtypes, list) and len(arg_dtypes) > 0 else arg_dtypes | ||
| # We found that the kernel npu_add_rms_norm_quant accept varying data format for different dtypes, therefore, we only | ||
| # provide the solution on bfloat16 here. | ||
| return dtype in (torch.bfloat16, ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quiet understand here. Does the format of data also influence pattern matching? Maybe we can define patterns separately for bf16 and fp16 to support them both?
whx-sjtu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have another question here. With current proposal can we reuse the ready-made fusion passes defined in vLLM, like the SequenceParallel Fusion Pass. Because I'm not very familiar with the stack of the current Fusion pass in vLLM, I'm confirming it here. Reusability is what we expect.
|
This feature is very important for vllm-ascend. I also hope @jgong5 can take some time to review this PR. Thanks. |
Thank you for your reply. The current PR aims to define our own compiler backend to implement custom fusion. Reusing fusion passes in VLLM is my next goal. I will submit an RFC once the solution is finalized. |
What this PR does / why we need it?
Adopt inductor fusion and define quantization fusion pass
refer to vllm-project/vllm#23612
needs vllm-project/vllm#28623
Does this PR introduce any user-facing change?
Yes, add new additional_config
How was this patch tested?