-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Improve Windows ETW callback registration and fix issues #24877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Avoid multiple global callback multiple subscription that logs all the sessions. Only the first session to come up registeres it, the last session to go deregisters. Remove LazyInitializaton() as useless. The intent seems to be to retry on failure, but it never retries since it checks for NotInitialized, and even on failure it sets status to Initialized. Re-work callback registration. Right now it stores pointers to callbacks, but a global one repeatedly getting re-created and that creates a race condition. Store the callbacks in the map by a unique key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors ETW callback registration on Windows to ensure proper initialization, single global callback registration, per-session callback keys, and safe deregistration on constructor failure.
- Initialize all
EtwRegistrationManager
fields and replace raw pointers with keyed callbacks stored in maps - Register the global ETW provider callback once and per-session callbacks with unique string keys
- Add an RAII guard (
AutoETWDegistrar
) to clean up registrations if session construction fails
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
core/session/provider_bridge_ort.cc | Updated provider interface to accept a callback key and moved callback ownership |
core/session/inference_session.h / inference_session.cc | Added string keys, auto-deregistration guard, and global vs per-session callback logic |
core/providers/qnn/qnn_telemetry.h / qnn_telemetry.cc | Refactored QnnTelemetry to use unordered_map<string, callback> and keyed registration |
core/providers/qnn/qnn_execution_provider.h / qnn_execution_provider.cc | Switched QNN EP to generate and use a callback key for ETW registration |
core/platform/windows/telemetry.h / telemetry.cc | Refactored WindowsTelemetry to use keyed callbacks in an unordered_map |
core/platform/windows/logging/etw_sink.h / etw_sink.cc | Initialized new fields, removed old vector, and replaced callbacks with map |
core/framework/execution_providers.h / execution_providers.cc | Introduced string-keyed ETW callback registration in ExecutionProviders |
Comments suppressed due to low confidence (2)
onnxruntime/core/session/inference_session.h:143
- The static member
callback_ML_ORT_provider_
is no longer used after moving to keyed callbacks; consider removing it to reduce dead code.
static onnxruntime::WindowsTelemetry::EtwInternalCallback callback_ML_ORT_provider_;
onnxruntime/core/session/inference_session.h:144
- [nitpick] Naming is inconsistent with other components that use
callback_etw_key_
; consider renaming this member tocallback_ETWSink_key_
for consistency.
std::string callback_ETWSink_provider_key_; // Session Start Stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dmitri!
I know you already have done this, but please also detail what testing was done and how.
Please include comments about looking at the resulting .etl file and verifying the rundown, dynamic LOGS(), and QNN logs (in the case of QNN EP) are in the .etl file
Just good for future reference for anyone modifying this code.
Testing was performed using a usual build test for CPU and QNN providers. |
The WindowsTelemetry obj usually is a global var. Refers to: onnxruntime/core/platform/windows/telemetry.cc:93 in 6ab0185. [](commit_id = 6ab0185, deletion_comment = False) |
- Fixes a possibility of a deadlock between active_sessions_mutex_ and callbacks_mutex_ due to reversed order of acquisition. - Removes access to a global mutex from instance methods, change to atomics. - Moves ETW members to the end of InferenceSession class.
Are there any tests we can add to prevent regression from the new behavior? |
I used the test package from @ivberg, but it requires an extensive setup with WAP and mostly attempts to create many concurrent sessions. I have not reproduced any bugs that are fixed in this PR. |
// Register callback for ETW capture state (rundown) | ||
etw_callback_key_ = "ExecutionProviders_rundown_"; | ||
etw_callback_key_.append(std::to_string(reinterpret_cast<uintptr_t>(this))); | ||
WindowsTelemetry::RegisterInternalCallback( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So each session will register a callback.
And each callback will be invoked for each session.
This is O(N^2)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each instance of ExecutionProviders would register a callback, this has not changed.
It would log all EPs that were added to the instance.
In theory, each instance can have different set of EPs.
Question is, what should this be? @ivberg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be that each EP logs it's options. Taking a trace with captureState rundown is an unusual diagnostic event so happens quite rarely
Is information on how to do that captured somewhere so the next developer that makes changes can do the same manual steps? |
Perhaps it should use a different provider? One provider can only be registered once per process. Refers to: onnxruntime/core/providers/qnn/qnn_telemetry.cc:47 in e62225b. [](commit_id = e62225b, deletion_comment = False) |
Description
EtwRegistrationManager
. Make sure all fields initialized by a constructorML_Ort_Provider_Etw_Callback
once for all the sessions. The first session registers, the last one to go away removes the callback to Log all sessions. For this we make callbacks ref-counted inside the map they are stored in. This is done to prevent a deadlock whereactive_sessions_mutex_
andcallback_mutex_
are acquired from different threads in a different order.InferenceSession
constructor does not finish.Motivation and Context
This PR is inspired by #24773.
Current code exhibits multiple issues.
EtwRegistrationManager
constructor does not initialize all of the fields including theInitializationStatus
.active_sessions_lock_
andcallback
lock are not acquired/released in the same order by different threads which is a classic deadlock scenario.