watchdog: Support ro.hw_timeout_multiplier on Android#6123
watchdog: Support ro.hw_timeout_multiplier on Android#6123tutan-chromium wants to merge 4 commits into
Conversation
Multiply fatal timer durations by the integer multiplier read from the system property `ro.hw_timeout_multiplier` when building for Android. This keeps Perfetto thread and task-runner timeouts safe from triggering false-positive crashes (wdog_reason: 3) during prolonged suspend/resume cycles or on slow emulated environments. I use direct `__system_property_get` to scale the interval `ms` inside `CreateFatalTimer`, ensuring that we avoid pulling heavy dependencies like libbase into Perfetto base. Non-Android platforms are unaffected. Bug: 504382878
🎨 Perfetto UI Builds
|
…tiplier Add comments explaining that this system property is already used for all sorts of Android watchdogs and that bringing Perfetto watchdog behavior closer to Android's standard behavior reduces surprises. Bug: 6122
There was a problem hiding this comment.
Thanks for the PR!
FYI, we are also investigating watchdog issue on android (http://b/496506220). And to summarise the current status:
- Added a trigger that captures a trace when traced_probes disconnects (#5299).
- @primiano made the watchdog log the actual mono/boot/cpu clocks at crash time (#5749), and gave the process ~60s to flush the trace before crashing instead of dying immediately (#5755), so the captured traces are more useful.
- The traces we got back were all kTraceDidntStop (wdog reason 4).
- And on looking into them, the system was just busy immediately after boot (heavy ftrace load) and the timeout we compute was just too short. mono == boot in every case, so most likely suspend wasn't involved.
- So we raised a PR to floor that timeout at 5 min (#6045).
- This was done last week, waiting for more traces to investigate the watchdog:3 issue.
Happy to know if you any other insight.
| uint32_t GetHwTimeoutMultiplier() { | ||
| static uint32_t multiplier = []() { | ||
| char prop_val[PROP_VALUE_MAX] = {0}; | ||
| if (__system_property_get("ro.hw_timeout_multiplier", prop_val) > 0) { |
| static uint32_t multiplier = []() { | ||
| char prop_val[PROP_VALUE_MAX] = {0}; | ||
| if (__system_property_get("ro.hw_timeout_multiplier", prop_val) > 0) { | ||
| int val = atoi(prop_val); |
| char prop_val[PROP_VALUE_MAX] = {0}; | ||
| if (__system_property_get("ro.hw_timeout_multiplier", prop_val) > 0) { | ||
| int val = atoi(prop_val); | ||
| return val > 1 ? static_cast<uint32_t>(val) : 1u; |
There was a problem hiding this comment.
nit/slight worry: AFAICT by code searching, the value of multiplier is usually small. But just to be defensive shall we clamp it ? Because if some erroneous value (should not happen), the below multiplication can overflow as timeout is also high (5min = 300000).
Also just curious on what are the possible values for your device ?
There was a problem hiding this comment.
It's set to 500, for somewhat related reasons.
There was a problem hiding this comment.
As Filip said, we use 500 for our relatively slow HW. And I added uint32_t overflow check.
Use GetAndroidProp() from android_utils.cc instead of __system_property_get(), and StringToUInt32() from string_utils.h instead of atoi(). Additionally, clamp the final timeout multiplication to prevent overflow of uint32. Bug: 504382878
Thank you so much for the information! In our case, we are likely dealing with an edge case on a relatively slow SoC with potential instabilities in suspend/resume state transitions and monotonic clock drifts (see b/460841705). I reviewed the changes you mentioned, and they will definitely be very useful for us as well. Thanks again! |
| #include <sys/timerfd.h> | ||
| #include <unistd.h> | ||
|
|
||
| #if PERFETTO_BUILDFLAG(PERFETTO_OS_ANDROID) |
There was a problem hiding this comment.
nit: pls place it below along with other perfetto imports and after the stdlib ones.
| if (scaled_ms > std::numeric_limits<uint32_t>::max()) { | ||
| ms = std::numeric_limits<uint32_t>::max(); | ||
| } else { | ||
| ms = static_cast<uint32_t>(scaled_ms); | ||
| } |
There was a problem hiding this comment.
nit optional: to redue the number of lines
ms = static_cast<uint32_t>(
std::min<uint64_t>(scaled_ms, std::numeric_limits<uint32_t>::max()));
Multiply fatal timer durations by the integer multiplier read from the system property
ro.hw_timeout_multiplierwhen building for Android.This keeps Perfetto thread and task-runner timeouts safe from triggering false-positive crashes (wdog_reason: 3) during prolonged suspend/resume cycles or on slow emulated environments.
I use direct
__system_property_getto scale the intervalmsinsideCreateFatalTimer, ensuring that we avoid pulling heavy dependencies like libbase into Perfetto base. Non-Android platforms are unaffected.Bug: 6122