-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuse Initializers Graph Transform #24175
base: main
Are you sure you want to change the base?
Conversation
The newly added Graph Transforms performs the following actions. - Detect Cast node/s with single FP16 initializer converting to FP32. - Convert all such FP16 initializer/s to FP32 initializer/s. - Fuse newly created FP32 initializer/s to relative FP32 node/s. - Remove FP16 to FP32 Cast node/s. Note: For naming purpose the newly added Graph Trasnforms in long form is called "Fused Initializers Graph Transforms", and in short form is called "FIGT". Signed-off-by: Sunny Shukla <[email protected]>
This change helps with the following requirements - Ability to turn off the FIGT optimization. - Ability to re-run Level-1 to Level-3 optimizations, only if FIGT optimization is applied. - Keep the current flow of graph optimizations untouched. Signed-off-by: Sunny Shukla <[email protected]>
Regarding to "fp16 initializers -> cast_from_fp16_to_fp32 -> fp32 node/s", I think it is possible to do |
Transformers actually run in a loop, until no more graph modifications are made. |
I have updated the description of this PR with a new Working tab. This particular transform depends on "Insert Cast Transforms" to detect the unsupported nodes and produce the intermediate representation. |
To the best of my knowledge, Transformers of a particular Level runs in a loop, until no more graph modifications are required. This deduction of mine is based on As of now, our current graph optimizations flow ( If required, I can try to run all graph optimization in a loop in |
I have concern on the design change: "Insert Copy nodes" assumes that partition of nodes to EP is finalized. We shall not re-run partition later. I think there are two options: |
Ok, I was not aware that "Insert Copy Nodes" assumes that the partition of nodes to EP is finalized. In that case, yes I agree, we shouldn't be running the partitioning after the "Insert Copy Nodes". Also, keeping this type of optimization under a different level (
I can move the Level 4 Fusion Optimization to execute before "Insert Copy Nodes" and re-run In addition, we saw a considerable performance gains when we re-run the Level 1, 2, and 3 optimizations after this fusion optimization is applied. The reason behind re-running Level1, Partitioning, Level2, and Level3 graph transforms is that, after the fusion, the nodes are now in a format that might be supported by other graph transforms that were skipped before. Hence, some of the transforms not applied before are now valid and can be applied to create a more optimal graph for execution. |
@sunnyshu-intel, could you merge latest main and resolve the conflicts. level >=2 has assumption that the partition is done since those optimizations are provider specified, so we cannot run partition twice. It is probably no need to add a new level. It is because the optimizer could reduce memory usage: Previously, it need fp16 initializer and also a temp buffer for fp32 after Cast. After the fusion, it only need memory for fp32 version of initializer. I do not think users need to exclude it explicitly. I suggest to add the new optimizer to level 2, then the workflow is like:
OR
If you want to enable/disable it in testing, ORT has an internal option to disable some optimizers during creating session like this. |
Description
Added a graph transform for mixed precisions graphs when FP16 compute is unavailable. At session creation, this graph transform converts FP16 initializers (which were changed to FP16 to FP32 cast nodes) to FP32 initializers and fuses them with their next FP32 nodes.
Behavior before this change:
"fp16 initializers -> cast_from_fp16_to_fp32 -> fp32 node/s"
Behavior after this change:
"fp16 initializers converted to fp32 initializers then fused with fp32 node/s"
Motivation and Context
This change aims to run the FP16 models without the repetitive casting of FP16 initializers to FP32 initializers, by fusing FP32 initializers with their next nodes, when FP16 compute is not available.
Two separate commits For this PR
This PR consists of two separate commits.
Re-Running Level 1 to Level 3 optimizations after Level 4 / FIGT
The idea behind re-running Level1, Partitioning, Level2, and Level3 graph transforms is that, after the fusion of initializers with their respective nodes, the nodes are now in a format that might be supported by other graph transforms that were previously skipped. Hence, some of the transformations previously unable to be applied are now valid and can be applied to create a more optimal graph for execution.
Documentation
We have not yet added any details related to Level 4 in the Graph Optimizations in ONNX Runtime documentation. We might need a little bit of guidance on how to update this documentation, once the PR is accepted.
Working
Currently, the Fuse Initializers Graph Transform fuses cast nodes that casts from FP16 to FP32, back to their
next/output nodes. Below is an explanation of how this transforms works. It depends on
InsertCastTransforms
to produce the intermediate representation from which it fuses the initializers (which are the cast node with
zero input, one initializer, and one output) back to the next/output node. After fusion, the link/edge between such
cast node to the next/output node will then be removed.