feat: second attempt to support DDS and NonZero op #3388

zewenli98 · 2025-02-11T00:40:49Z

Description

Added a new path to support Data Dependent Shape (DDS) and NonZero op in this PR.
Static and dynamic shapes go the original path; DDS goes the new path with IOutputAllocator.

Fixes #2516

Type of change

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

py/torch_tensorrt/dynamo/_engine_cache.py

py/torch_tensorrt/dynamo/conversion/_ConverterRegistry.py

py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py

peri044 · 2025-03-03T21:41:19Z

py/torch_tensorrt/dynamo/lowering/passes/remove_num_users_is_0_nodes.py

+        if (
+            node != output_node
+            and len(node.users) == 0
+            and len(node.all_input_nodes) > 0


probably better to add an assert checking if if has only one input (print the number in the string if it fails)

I previously reused the code from other lowering pass. it looks like we can directly remove unused ops right?

TensorRT/py/torch_tensorrt/dynamo/lowering/passes/remove_num_users_is_0_nodes.py

Lines 20 to 28 in eed420a

if (

node != output_node

and len(node.users) == 0

and len(node.all_input_nodes) > 0

):

gm.graph.erase_node(node)

gm = clean_up_graph_after_modifications(gm)

logger.debug(f"Removed ops that [num_users=0] nodes:\n{gm.graph}")

do you think if there's any potential issues?

peri044 · 2025-03-03T21:47:08Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

            shape_changed = self.validate_input_shapes(inputs)
            (
                need_cudagraphs_record,
                can_use_pre_allocated_outputs,
                need_cudagraphs_reset,
            ) = self.runtime_states.set_runtime_states(
-                cudagraphs_enabled, self.use_pre_allocated_outputs, shape_changed
+                self.cudagraphs_enabled, self.use_pre_allocated_outputs, shape_changed


Is use_pre_allocated_outputs valid now that you're adding OA feature ?

I think the OA feature will not affact use_pre_allocated_outputs because I didn't change the behavior of CG and use_pre_allocated_outputs has its own context manager as well.

peri044 · 2025-03-03T21:49:03Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

+                    raise RuntimeError(
+                        "Both CUDA Graphs and OutputAllocator are enabled. Please disable either one."
+                    )
+                if self.use_output_allocator_outputs:


How is use_output_allocator_outputs set ? Is it by using the with context manager by the user ?

yes, it will be set by the with context manager by the user. If users don't set it, it will choose standard exec or OA according to the converter decorator.

py/torch_tensorrt/runtime/_cudagraphs.py

core/runtime/execute_engine.cpp

narendasan · 2025-03-11T17:36:16Z

core/runtime/execute_engine.cpp

-    if (!cudagraphs_enabled) {
-      // Direct execution uses the caller buffers directly
-      compiled_engine->exec_ctx->enqueueV3(compiled_engine->engine_stream);
+      LOG_DEBUG("Using OutputAllocator in runtime.");


Theres two of these messages?

yes, because there are two cases using OA:

the engine requires OA;

the engine doesn't requires OA but users call OA with context manager.

core/runtime/runtime.h

narendasan

Overall looks good, just some house keeping things

py/torch_tensorrt/runtime/_cudagraphs.py

py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py

narendasan · 2025-03-11T17:44:37Z

py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py

@@ -275,6 +293,9 @@ def set_extra_state(self, state: SerializedTorchTensorRTModuleFmt) -> None:
    def set_pre_allocated_outputs(self, enable: bool) -> None:


What is the relationship between this method and using output allocator

use_pre_allocated_outputs will be silently ignored for now because it is expected to use in CG mode. If OA is enabled, there's no code calling use_pre_allocated_outputs.

If its not an API intended for users consider adding a _ prefix for it like _use_preallocated_outputs

Oh sorry for the confusion. I meant use_pre_allocated_outputs will be ignored for OA mode. It's still useful in CG mode.

For example:
Given DDS compiled_model, the following two cases will use OA and ignore enable_pre_allocated_outputs context manager:

with torch_tensorrt.runtime.enable_pre_allocated_outputs(compiled_model): cg_out = compiled_model(*inputs) or with torch_tensorrt.runtime.enable_output_allocator(compiled_model): with torch_tensorrt.runtime.enable_pre_allocated_outputs(compiled_model): cg_out = compiled_model(*inputs)

Given NonDDS compiled_model, it will keep the original behavior (i.e., in standard execution including CG on or off)

peri044

Minor comments. LGTM

peri044 · 2025-03-11T17:29:07Z

core/runtime/execute_engine.cpp

+        output_profiler_guard =
+            std::make_unique<torch::autograd::profiler::RecordProfile>(compiled_engine->output_profile_path);
+      }
+      if (can_use_pre_allocated_outputs) {


What are your thoughts on if pre-allocated outputs feature is needed with this OA ? Do they complement each other or not related ? cc: @keehyuna

use_pre_allocated_outputs will be silently ignored for OA mode in the current implementation, because use_pre_allocated_outputs is expected to use in CG mode. If OA is enabled, there's no code calling use_pre_allocated_outputs.

peri044 · 2025-03-11T17:48:43Z

py/torch_tensorrt/runtime/_cudagraphs.py

+                and module.requires_output_allocator
+            ):
+                raise RuntimeError(
+                    "There are converters that require Output Allocator. Please disable CUDA Graphs."


Consider changing this message to The model contains operations that require a dynamic output allocator at runtime, which is incompatible with CUDA Graph execution. Please disable CUDA Graph mode to ensure successful execution. or something else instead of using converters

I combined your suggestion with Naren's:

The model contains submodules that require a dynamic output allocator at runtime, which is incompatible with CUDA Graphs. Please disable CUDA Graphs.

zewenli98 requested review from narendasan, peri044 and keehyuna February 11, 2025 00:40

zewenli98 self-assigned this Feb 11, 2025

facebook-github-bot added the cla signed label Feb 11, 2025

github-actions bot requested a review from apbose February 11, 2025 00:41

keehyuna reviewed Feb 11, 2025

View reviewed changes

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py Outdated Show resolved Hide resolved

zewenli98 force-pushed the dds_support2 branch from 98aebfd to 8ef1a87 Compare February 13, 2025 21:35

zewenli98 force-pushed the dds_support2 branch from d55c451 to ad04cf9 Compare February 26, 2025 21:26

github-actions bot added the component: lowering Issues re: The lowering / preprocessing passes label Feb 26, 2025

zewenli98 force-pushed the dds_support2 branch 2 times, most recently from 9a9852f to d718464 Compare February 28, 2025 18:20

zewenli98 mentioned this pull request Mar 1, 2025

Support Data Dependent Shape (DDS) and NonZero op #3364

Closed

7 tasks

narendasan reviewed Mar 3, 2025

View reviewed changes

py/torch_tensorrt/dynamo/_engine_cache.py Show resolved Hide resolved

narendasan reviewed Mar 3, 2025

View reviewed changes

py/torch_tensorrt/dynamo/conversion/_ConverterRegistry.py Outdated Show resolved Hide resolved

narendasan reviewed Mar 3, 2025

View reviewed changes

py/torch_tensorrt/dynamo/conversion/_ConverterRegistry.py Outdated Show resolved Hide resolved

narendasan reviewed Mar 3, 2025

View reviewed changes

py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py Outdated Show resolved Hide resolved

peri044 reviewed Mar 3, 2025

View reviewed changes

zewenli98 requested review from peri044 and narendasan March 4, 2025 05:34

zewenli98 force-pushed the dds_support2 branch from eed420a to 28b27c5 Compare March 4, 2025 05:38

zewenli98 added 3 commits March 6, 2025 12:11

support dds and nonzero op

26b8647

check output shape to implicitly decide whether network is dds

8666549

fix bug1

7f7385e

zewenli98 added 7 commits March 6, 2025 12:11

disable cuda graph for output allocator mode

aff442c

implement with ctx manager

a7d6b5d

refactor

be4683e

remove sym_size lowering pass

6026ef9

fix bugs from CI

e40a697

resolve comments

107599a

support C++ runtime and add tests

7e1a1ca

zewenli98 force-pushed the dds_support2 branch from 28b27c5 to 7e1a1ca Compare March 11, 2025 00:04

github-actions bot added the component: core Issues re: The core compiler label Mar 11, 2025

minor fixes

7326064