-
Notifications
You must be signed in to change notification settings - Fork 219
[Frontend] Implement Online LayerNorm with Frontend Fusion and Lowering Pass Support #587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
What about the performance of E2E integration? Has the issue of performance decline in E2E been resolved? |
OK,the optimization on this operator still faces a performance decline. I think the affine + vector optimization strategy may not work, I'm working on another optimization strategy. |
In addition, online algorithms can effectively fuse multiple reductions with dependency relationships and the same dimension, but the effect is not ideal when there is only one reduction. Therefore, I recommend that you try to fuse RMSNorm with its subsequent matmul.
|
Thank you for your advice. You are right — I misunderstood the online algorithm. I will try to fuse RMSNorm with its subsequent matmul as you suggested. |
|
@GuoningHuang I have tried your PR. If you disable the matmul transpose fusion, performance won't drop, but it also doesn't improve. This might be related to the fusion strategy issue that @CBalaa mentioned, as well as the fact that the fused code hasn't been optimized for vectorization and parallelization, so scalar execution doesn't show a significant performance advantage. I'm not sure where the 50x performance gain you mentioned is coming from. |
|
I implemented flash attention using mlir here. You can refer to the code here to add flash attention to the buddy frontend. |
OK, thank you for your suggestion! |
Thank you! I will try it. |
This PR introduces an optimized implementation of Online LayerNorm
It provides both frontend operator fusion and the corresponding lowering pass to support end-to-end execution.