Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WebNN EP] Automatically use ml-tensor for outputs #24282

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

egalli
Copy link
Contributor

@egalli egalli commented Apr 2, 2025

Description

If it would improve performance, this patch moves outputs to MLTensor backed Tensors.

Motivation and Context

We are currently performing an extra copy on output tensors located in the CPU when using the WebNN EP (MLTensor -(copy)-> wasm heap -(copy)-> JS). This patch removes this copy by moving the readback to JS instead of wasm. As an extra benefit, we can also start the readbacks and wait for them in parallel.

This change is similar to #23073

### Description
If it would improve performance, this patch moves outputs to MLTensor backed Tensors.

### Motivation and Context
We are currently performing an extra copy on output tensors located in the CPU when using the WebNN EP (MLTensor -(copy)-> wasm heap -(copy)-> JS). This patch removes this copy by moving the readback to JS instead of wasm. As an extra benefit, we can also start and wait for the readbacks in parallel.
@snnn snnn closed this Apr 3, 2025
@snnn snnn reopened this Apr 3, 2025
@guschmue guschmue added the ep:WebNN WebNN execution provider label Apr 3, 2025
@snnn
Copy link
Member

snnn commented Apr 3, 2025

/azp run all

Copy link

No pipelines are associated with this pull request.

@snnn
Copy link
Member

snnn commented Apr 3, 2025

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

Comment on lines 179 to 180
| 'ml-tensor'
| 'ml-tensor-cpu-output';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 'ml-tensor' cannot be used for output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be used for outputs.

The issue that this PR is trying to solve occurs when the developer asks for preferredLocation:"cpu" (or the undefined since we default to "cpu").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry but I may have missed some context. Still not understanding why we need 'ml-tensor-cpu-output'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the developer creates an InferenceSession with preferredOutputLocation: "cpu", (or undefined). The output tensor readback from the device will have to happen in the wasm code. All wasm code is restricted to the wasm heap. This means that we need to copy the content of the underlying MLTensor to the wasm heap (webnn::DataTransfer::CopyTensor in C++), then copy it out of the wasm heap to a Uint8Array

new Uint8Array(data.buffer, data.byteOffset, data.byteLength).set(
              wasm.HEAPU8.subarray(dataOffset, dataOffset + data.byteLength),
            );

It is more efficient to directly readback the MLTensor content to a JS Uint8Array (1 copy vs 2 copies).

I added 'ml-tensor-cpu-output' as a way to use ml-tensor as an output behind the scenes (bypassing the DataTransfer::CopyTensor code), but still return a cpu located Tensor to the developer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be simplified to always bypassing the wasm copy. For example, WebGPU always upload initializers to GPU without reading into WASM memory if the initializer data is in the external data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, this PR tries to optimize the process of reading CPU data from MLTensor as a model's output. While I am totally OK with the optimization itself, I have concern about adding a new value to the SupportedTensorDataLocationForInputOutput.

The session option preferredOutputLocation (type of DataLocation) defines how user prefers the output's location. Specifically:

  • setting to cpu: we expect the output tensor's data is a JavaScript TypedArray.
  • setting to ml-tensor: we expect the output tensor's data is preserved as an instance of MLTensor.

I think the 2 values should have covered all cases. 'ml-tensor-cpu-output' seems same to 'cpu' to me, with the only difference is to optimize how to get the data. It looks like we can just apply the optimization for 'cpu' and no need to introduce 'ml-tensor-cpu-output'. Please let me know if this makes sense to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing confused me is that the API is not changed but a new value added to SupportedTensorDataLocationForInputOutput.
(DataLocation does not change) common/lib/tensor.ts:

  /**
   * represent where the tensor data is stored
   */
  export type DataLocation = 'none' | 'cpu' | 'cpu-pinned' | 'texture' | 'gpu-buffer' | 'ml-tensor';

How do you expect this change to get to users? If you want to add ml-tensor-cpu-output to DataLocation, what will be the difference between "ml-tensor-cpu-output" and "cpu"?

This change should not affect users. If the user runs a model with cpu located Tensors for on graph outputs (fetches or preferredLocation), this PR:

  1. Automatically moves them to ml-tensors
  2. Runs the model
  3. Copies the data back from ml-tensors
  4. Returns cpu located tensors back to the user

I see. Thanks for the clarification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 2 values should have covered all cases. 'ml-tensor-cpu-output' seems same to 'cpu' to me, with the only difference is to optimize how to get the data. It looks like we can just apply the optimization for 'cpu' and no need to introduce 'ml-tensor-cpu-output'. Please let me know if this makes sense to you.

I have removed 'ml-tensor-cpu-output'. Since we can't move all 'cpu' tensors to 'ml-tensor' (i.e. some could have been from CPU EP fallback nodes) and we still need to tell the C++ code to use ml-tensor, the code is more complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can't move all 'cpu' tensors to 'ml-tensor' (i.e. some could have been from CPU EP fallback nodes)

I checked the change in 11d5966 and agree that this is more complicated. I think it's good to get 'ml-tensor-cpu-output' back but we'd better put some comments for it.

(sorry for making back and forth. I just try to figure out the most clean way to do the change)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I was also not happy adding a new pseudo location.

I have reverted the change and added a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebNN WebNN execution provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants