-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WebNN EP] Automatically use ml-tensor for outputs #24282
base: main
Are you sure you want to change the base?
Conversation
### Description If it would improve performance, this patch moves outputs to MLTensor backed Tensors. ### Motivation and Context We are currently performing an extra copy on output tensors located in the CPU when using the WebNN EP (MLTensor -(copy)-> wasm heap -(copy)-> JS). This patch removes this copy by moving the readback to JS instead of wasm. As an extra benefit, we can also start and wait for the readbacks in parallel.
/azp run all |
No pipelines are associated with this pull request. |
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 5 pipeline(s). |
js/web/lib/wasm/wasm-core-impl.ts
Outdated
| 'ml-tensor' | ||
| 'ml-tensor-cpu-output'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 'ml-tensor' cannot be used for output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be used for outputs.
The issue that this PR is trying to solve occurs when the developer asks for preferredLocation:"cpu"
(or the undefined
since we default to "cpu"
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry but I may have missed some context. Still not understanding why we need 'ml-tensor-cpu-output'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the developer creates an InferenceSession
with preferredOutputLocation: "cpu",
(or undefined
). The output tensor readback from the device will have to happen in the wasm code. All wasm code is restricted to the wasm heap. This means that we need to copy the content of the underlying MLTensor
to the wasm heap (webnn::DataTransfer::CopyTensor
in C++), then copy it out of the wasm heap to a Uint8Array
new Uint8Array(data.buffer, data.byteOffset, data.byteLength).set(
wasm.HEAPU8.subarray(dataOffset, dataOffset + data.byteLength),
);
It is more efficient to directly readback the MLTensor
content to a JS Uint8Array (1 copy vs 2 copies).
I added 'ml-tensor-cpu-output' as a way to use ml-tensor
as an output behind the scenes (bypassing the DataTransfer::CopyTensor
code), but still return a cpu
located Tensor to the developer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be simplified to always bypassing the wasm copy. For example, WebGPU always upload initializers to GPU without reading into WASM memory if the initializer data is in the external data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, this PR tries to optimize the process of reading CPU data from MLTensor as a model's output. While I am totally OK with the optimization itself, I have concern about adding a new value to the SupportedTensorDataLocationForInputOutput
.
The session option preferredOutputLocation
(type of DataLocation
) defines how user prefers the output's location. Specifically:
- setting to
cpu
: we expect the output tensor's data is a JavaScript TypedArray. - setting to
ml-tensor
: we expect the output tensor's data is preserved as an instance ofMLTensor
.
I think the 2 values should have covered all cases. 'ml-tensor-cpu-output'
seems same to 'cpu'
to me, with the only difference is to optimize how to get the data. It looks like we can just apply the optimization for 'cpu'
and no need to introduce 'ml-tensor-cpu-output'
. Please let me know if this makes sense to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing confused me is that the API is not changed but a new value added to
SupportedTensorDataLocationForInputOutput
.
(DataLocation does not change) common/lib/tensor.ts:/** * represent where the tensor data is stored */ export type DataLocation = 'none' | 'cpu' | 'cpu-pinned' | 'texture' | 'gpu-buffer' | 'ml-tensor';How do you expect this change to get to users? If you want to add
ml-tensor-cpu-output
toDataLocation
, what will be the difference between "ml-tensor-cpu-output" and "cpu"?This change should not affect users. If the user runs a model with cpu located Tensors for on graph outputs (fetches or preferredLocation), this PR:
- Automatically moves them to ml-tensors
- Runs the model
- Copies the data back from ml-tensors
- Returns cpu located tensors back to the user
I see. Thanks for the clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 2 values should have covered all cases.
'ml-tensor-cpu-output'
seems same to'cpu'
to me, with the only difference is to optimize how to get the data. It looks like we can just apply the optimization for'cpu'
and no need to introduce'ml-tensor-cpu-output'
. Please let me know if this makes sense to you.
I have removed 'ml-tensor-cpu-output'
. Since we can't move all 'cpu'
tensors to 'ml-tensor'
(i.e. some could have been from CPU EP fallback nodes) and we still need to tell the C++ code to use ml-tensor
, the code is more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we can't move all 'cpu' tensors to 'ml-tensor' (i.e. some could have been from CPU EP fallback nodes)
I checked the change in 11d5966 and agree that this is more complicated. I think it's good to get 'ml-tensor-cpu-output'
back but we'd better put some comments for it.
(sorry for making back and forth. I just try to figure out the most clean way to do the change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. I was also not happy adding a new pseudo location.
I have reverted the change and added a comment.
Description
If it would improve performance, this patch moves outputs to MLTensor backed Tensors.
Motivation and Context
We are currently performing an extra copy on output tensors located in the CPU when using the WebNN EP (MLTensor -(copy)-> wasm heap -(copy)-> JS). This patch removes this copy by moving the readback to JS instead of wasm. As an extra benefit, we can also start the readbacks and wait for them in parallel.
This change is similar to #23073