-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: second attempt to support DDS and NonZero op #3388
base: main
Are you sure you want to change the base?
Conversation
98aebfd
to
8ef1a87
Compare
d55c451
to
ad04cf9
Compare
9a9852f
to
d718464
Compare
if ( | ||
node != output_node | ||
and len(node.users) == 0 | ||
and len(node.all_input_nodes) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to add an assert checking if if has only one input (print the number in the string if it fails)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously reused the code from other lowering pass. it looks like we can directly remove unused ops right?
TensorRT/py/torch_tensorrt/dynamo/lowering/passes/remove_num_users_is_0_nodes.py
Lines 20 to 28 in eed420a
if ( | |
node != output_node | |
and len(node.users) == 0 | |
and len(node.all_input_nodes) > 0 | |
): | |
gm.graph.erase_node(node) | |
gm = clean_up_graph_after_modifications(gm) | |
logger.debug(f"Removed ops that [num_users=0] nodes:\n{gm.graph}") |
do you think if there's any potential issues?
shape_changed = self.validate_input_shapes(inputs) | ||
( | ||
need_cudagraphs_record, | ||
can_use_pre_allocated_outputs, | ||
need_cudagraphs_reset, | ||
) = self.runtime_states.set_runtime_states( | ||
cudagraphs_enabled, self.use_pre_allocated_outputs, shape_changed | ||
self.cudagraphs_enabled, self.use_pre_allocated_outputs, shape_changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is use_pre_allocated_outputs
valid now that you're adding OA feature ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the OA feature will not affact use_pre_allocated_outputs
because I didn't change the behavior of CG and use_pre_allocated_outputs
has its own context manager as well.
raise RuntimeError( | ||
"Both CUDA Graphs and OutputAllocator are enabled. Please disable either one." | ||
) | ||
if self.use_output_allocator_outputs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is use_output_allocator_outputs
set ? Is it by using the with context manager by the user ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it will be set by the with context manager by the user. If users don't set it, it will choose standard exec or OA according to the converter decorator.
core/runtime/execute_engine.cpp
Outdated
if (!cudagraphs_enabled) { | ||
// Direct execution uses the caller buffers directly | ||
compiled_engine->exec_ctx->enqueueV3(compiled_engine->engine_stream); | ||
LOG_DEBUG("Using OutputAllocator in runtime."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theres two of these messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, because there are two cases using OA:
- the engine requires OA;
- the engine doesn't requires OA but users call OA with context manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, just some house keeping things
@@ -275,6 +293,9 @@ def set_extra_state(self, state: SerializedTorchTensorRTModuleFmt) -> None: | |||
def set_pre_allocated_outputs(self, enable: bool) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the relationship between this method and using output allocator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_pre_allocated_outputs
will be silently ignored for now because it is expected to use in CG mode. If OA is enabled, there's no code calling use_pre_allocated_outputs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its not an API intended for users consider adding a _ prefix for it like _use_preallocated_outputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry for the confusion. I meant use_pre_allocated_outputs
will be ignored for OA mode. It's still useful in CG mode.
For example:
Given DDS compiled_model
, the following two cases will use OA and ignore enable_pre_allocated_outputs
context manager:
with torch_tensorrt.runtime.enable_pre_allocated_outputs(compiled_model):
cg_out = compiled_model(*inputs)
or
with torch_tensorrt.runtime.enable_output_allocator(compiled_model):
with torch_tensorrt.runtime.enable_pre_allocated_outputs(compiled_model):
cg_out = compiled_model(*inputs)
Given NonDDS compiled_model
, it will keep the original behavior (i.e., in standard execution including CG on or off)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. LGTM
output_profiler_guard = | ||
std::make_unique<torch::autograd::profiler::RecordProfile>(compiled_engine->output_profile_path); | ||
} | ||
if (can_use_pre_allocated_outputs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on if pre-allocated outputs feature is needed with this OA ? Do they complement each other or not related ? cc: @keehyuna
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_pre_allocated_outputs
will be silently ignored for OA mode in the current implementation, because use_pre_allocated_outputs
is expected to use in CG mode. If OA is enabled, there's no code calling use_pre_allocated_outputs
.
and module.requires_output_allocator | ||
): | ||
raise RuntimeError( | ||
"There are converters that require Output Allocator. Please disable CUDA Graphs." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing this message to The model contains operations that require a dynamic output allocator at runtime, which is incompatible with CUDA Graph execution. Please disable CUDA Graph mode to ensure successful execution.
or something else instead of using converters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I combined your suggestion with Naren's:
The model contains submodules that require a dynamic output allocator at runtime, which is incompatible with CUDA Graphs. Please disable CUDA Graphs.
Description
Added a new path to support Data Dependent Shape (DDS) and NonZero op in this PR.
Static and dynamic shapes go the original path; DDS goes the new path with IOutputAllocator.
Fixes #2516
Type of change
Checklist: