syncd: Add -W flag for platform specific watchdog timeout during initialization#1751
syncd: Add -W flag for platform specific watchdog timeout during initialization#1751Bojun-Feng wants to merge 10 commits intosonic-net:masterfrom
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
* Add set_watchdog_timeout() to prevent false "WD exceeded" errors * Set 150s timeout for voq/chassis-packet/dpu (5x default) * Set 300s timeout for fabric (10x default) * Preserve platform-specific timeouts (e.g., nvidia-bluefield) Signed-off-by: Bojun Feng <bojundf@gmail.com>
ed65157 to
a53d9d5
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
* Add -W flag for extended init timeout, -w for normal timeout * Transition to normal timeout automatically after APPLY_VIEW * Shell script uses multipliers (5x voq/dpu, 10x fabric) matching SWSS Signed-off-by: Bojun Feng <bojundf@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Bojun Feng <bojundf@gmail.com>
f43b673 to
c3991f8
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Updated the PR description with the full mechanics of the new One minor issue I want to call out: We have a duplicated 30s default. It now lives once in It annoyed me for a moment, and I considered just stripping the default value entirely from the C++ side. We can strictly enforce that The catch is I'm not 100% sure if For now I am leaving the duplicate defaults as they are. If |
Signed-off-by: Bojun Feng <bojundf@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Bojun Feng <bojundf@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Hi @kcudnik, just checking in — I've addressed all feedback and the CI is green. When you get a chance, could you take another look? Thanks! |
|
Hi @kcudnik, hope you're doing well. Just a quick nudge on this PR—let me know if there's anything else you'd like me to polish up before approval. Thanks! |
|
Hi @deepak-singhal0408, the logic is approved and passing CI. I narrowed the scope of the When you have a moment, would you mind testing it to verify the fix on chassis platforms? Once verified, we can get this merged. |
@Bojun-Feng , hope you have done some basic validation? we have virtual chassis support available in sonic-mgmt.. did you get a chance to verify it there? If not, please verify. if yes, i think we can get this merged.. |
|
Hi @deepak-singhal0408, thanks for the pointer! I wasn't aware of the virtual chassis support in |
|
@Bojun-Feng, can you please update the PR. with sairedis record for before and after the fix |
c03ff78
|
Hi @deepak-singhal0408 @arlakshm, wanted to give an update on validation. I used the Setup & Logs
Here are some related log snippets: sairedis.rec — SAI switch creation timing (ASIC 0)syslog — watchdog time span reportsWD exceeded errorsOn the VS platform, As an additional logic check, I can set What VS cannot prove is the real slow-init behavior seen on physical chassis hardware. The extended timeout in this PR is scoped only to the initialization slow paths ( If someone with access to chassis hardware can run a quick before/after comparison, that would be the most direct validation for the fix. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Latest push was a rebase on top of current master for new test file unrelated to the issue. There is no code change to the previously approved fix. |
@Bojun-Feng , thanks. lets verify this and once done, we can merge the PR. |
Why I did it
Fix sonic-net/sonic-buildimage#24803
On chassis platforms (VOQ, fabric, chassis-packet, DPU), syncd generates false-alarm "WD exceeded" errors during initialization. The default 30-second watchdog timeout is too short for these platforms.
Initially, I proposed statically extending the timeout, but as pointed out in the reviews, a permanent extension effectively neutralizes the watchdog's ability to detect hung operations during steady-state operation. We need a way to tolerate slow initialization without compromising ongoing system monitoring.
How I did it
Added a two-timeout setup: a longer timeout for the initialization phase, which automatically drops back to the standard timeout once init is done.
-Wfor the init timeout (similar to the existing-wflag introduced in [syncd] Add the possibility to overwrite the watchdog warning time span via the command line option. #1243).syncdstarts using the-Wtimeout. OnceAPPLY_VIEWsucceeds,syncdautomatically switches the watchdog back to the-wtimeout.-Wisn't provided, it defaults to value of-wfor backward compatibility. However, default-Wvalue is extended for chassis platforms:Why not the swss way (C++ DB read)
Context: There was timeout adjustments done in SWSS for Chassis platforms, they just read from CONFIG_DB:
orchagent/main.cpp#L693-L717
However, syncd is configured via its
-wcommand-line option in this repo. Reading CONFIG_DB in C++ at syncd startup would add a new DB dependency and ordering risk, while the value is already available in this script via sonic-cfggen. This keeps the watchdog setting with the component that consumes it and aligns with existing syncd init patterns.Also, the
nvidia-bluefieldplatform is already using this approach here to change their timeout: syncd/scripts/syncd_init_common.sh#L552.How to verify it
Deploy on a chassis platform (VOQ/fabric) and verify no false "WD exceeded" errors during syncd initialization.
Description for the changelog
Implement dual-timeout watchdog in syncd to prevent false-alarms during chassis initialization