[WebNN EP] Automatically use ml-tensor for outputs #24282

egalli · 2025-04-02T21:14:32Z

Description

If it would improve performance, this patch moves outputs to MLTensor backed Tensors.

Motivation and Context

We are currently performing an extra copy on output tensors located in the CPU when using the WebNN EP (MLTensor -(copy)-> wasm heap -(copy)-> JS). This patch removes this copy by moving the readback to JS instead of wasm. As an extra benefit, we can also start the readbacks and wait for them in parallel.

This change is similar to #23073

### Description If it would improve performance, this patch moves outputs to MLTensor backed Tensors. ### Motivation and Context We are currently performing an extra copy on output tensors located in the CPU when using the WebNN EP (MLTensor -(copy)-> wasm heap -(copy)-> JS). This patch removes this copy by moving the readback to JS instead of wasm. As an extra benefit, we can also start and wait for the readbacks in parallel.

snnn · 2025-04-03T16:19:41Z

/azp run all

azure-pipelines · 2025-04-03T16:19:47Z

No pipelines are associated with this pull request.

snnn · 2025-04-03T16:20:25Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

azure-pipelines · 2025-04-03T16:20:49Z

Azure Pipelines successfully started running 5 pipeline(s).

fs-eire · 2025-04-07T20:11:37Z

js/web/lib/wasm/wasm-core-impl.ts

+  | 'ml-tensor'
+  | 'ml-tensor-cpu-output';


why 'ml-tensor' cannot be used for output?

It can be used for outputs.

The issue that this PR is trying to solve occurs when the developer asks for preferredLocation:"cpu" (or the undefined since we default to "cpu").

I am sorry but I may have missed some context. Still not understanding why we need 'ml-tensor-cpu-output'

If the developer creates an InferenceSession with preferredOutputLocation: "cpu", (or undefined). The output tensor readback from the device will have to happen in the wasm code. All wasm code is restricted to the wasm heap. This means that we need to copy the content of the underlying MLTensor to the wasm heap (webnn::DataTransfer::CopyTensor in C++), then copy it out of the wasm heap to a Uint8Array

new Uint8Array(data.buffer, data.byteOffset, data.byteLength).set( wasm.HEAPU8.subarray(dataOffset, dataOffset + data.byteLength), );

It is more efficient to directly readback the MLTensor content to a JS Uint8Array (1 copy vs 2 copies).

I added 'ml-tensor-cpu-output' as a way to use ml-tensor as an output behind the scenes (bypassing the DataTransfer::CopyTensor code), but still return a cpu located Tensor to the developer.

I think it can be simplified to always bypassing the wasm copy. For example, WebGPU always upload initializers to GPU without reading into WASM memory if the initializer data is in the external data.

In my understanding, this PR tries to optimize the process of reading CPU data from MLTensor as a model's output. While I am totally OK with the optimization itself, I have concern about adding a new value to the SupportedTensorDataLocationForInputOutput.

The session option preferredOutputLocation (type of DataLocation) defines how user prefers the output's location. Specifically:

setting to cpu: we expect the output tensor's data is a JavaScript TypedArray.

setting to ml-tensor: we expect the output tensor's data is preserved as an instance of MLTensor.

I think the 2 values should have covered all cases. 'ml-tensor-cpu-output' seems same to 'cpu' to me, with the only difference is to optimize how to get the data. It looks like we can just apply the optimization for 'cpu' and no need to introduce 'ml-tensor-cpu-output'. Please let me know if this makes sense to you.

One thing confused me is that the API is not changed but a new value added to SupportedTensorDataLocationForInputOutput.
(DataLocation does not change) common/lib/tensor.ts:

/** * represent where the tensor data is stored */ export type DataLocation = 'none' | 'cpu' | 'cpu-pinned' | 'texture' | 'gpu-buffer' | 'ml-tensor';

How do you expect this change to get to users? If you want to add ml-tensor-cpu-output to DataLocation, what will be the difference between "ml-tensor-cpu-output" and "cpu"?

This change should not affect users. If the user runs a model with cpu located Tensors for on graph outputs (fetches or preferredLocation), this PR:

Automatically moves them to ml-tensors

Runs the model

Copies the data back from ml-tensors

Returns cpu located tensors back to the user

I see. Thanks for the clarification.

I think the 2 values should have covered all cases. 'ml-tensor-cpu-output' seems same to 'cpu' to me, with the only difference is to optimize how to get the data. It looks like we can just apply the optimization for 'cpu' and no need to introduce 'ml-tensor-cpu-output'. Please let me know if this makes sense to you.

I have removed 'ml-tensor-cpu-output'. Since we can't move all 'cpu' tensors to 'ml-tensor' (i.e. some could have been from CPU EP fallback nodes) and we still need to tell the C++ code to use ml-tensor, the code is more complicated.

Since we can't move all 'cpu' tensors to 'ml-tensor' (i.e. some could have been from CPU EP fallback nodes)

I checked the change in 11d5966 and agree that this is more complicated. I think it's good to get 'ml-tensor-cpu-output' back but we'd better put some comments for it.

(sorry for making back and forth. I just try to figure out the most clean way to do the change)

No problem. I was also not happy adding a new pseudo location.

I have reverted the change and added a comment.

This reverts commit 11d5966.

snnn closed this Apr 3, 2025

snnn reopened this Apr 3, 2025

guschmue added the ep:WebNN WebNN execution provider label Apr 3, 2025

fs-eire reviewed Apr 7, 2025

View reviewed changes

egalli added 4 commits April 7, 2025 13:42

Missing case where developer gives a filled tensor as output

0331c6a

Remove 'ml-tensor-cpu-output'

11d5966

Revert "Remove 'ml-tensor-cpu-output'"

a600315

This reverts commit 11d5966.

Adding comment on 'ml-tensor-cpu-output'

78a90c6

fs-eire approved these changes Apr 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebNN EP] Automatically use ml-tensor for outputs #24282

[WebNN EP] Automatically use ml-tensor for outputs #24282

egalli commented Apr 2, 2025

snnn commented Apr 3, 2025

azure-pipelines bot commented Apr 3, 2025

snnn commented Apr 3, 2025

azure-pipelines bot commented Apr 3, 2025

fs-eire Apr 7, 2025

egalli Apr 7, 2025

fs-eire Apr 7, 2025

egalli Apr 7, 2025

fs-eire Apr 10, 2025

fs-eire Apr 10, 2025

fs-eire Apr 11, 2025

egalli Apr 11, 2025

fs-eire Apr 11, 2025

egalli Apr 11, 2025

[WebNN EP] Automatically use ml-tensor for outputs #24282

Are you sure you want to change the base?

[WebNN EP] Automatically use ml-tensor for outputs #24282

Conversation

egalli commented Apr 2, 2025

Description

Motivation and Context

snnn commented Apr 3, 2025

azure-pipelines bot commented Apr 3, 2025

snnn commented Apr 3, 2025

azure-pipelines bot commented Apr 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment