Skip to content

syncd: Add -W flag for platform specific watchdog timeout during initialization#1751

Open
Bojun-Feng wants to merge 10 commits intosonic-net:masterfrom
Bojun-Feng:bug/chassis_false_timeout_msg
Open

syncd: Add -W flag for platform specific watchdog timeout during initialization#1751
Bojun-Feng wants to merge 10 commits intosonic-net:masterfrom
Bojun-Feng:bug/chassis_false_timeout_msg

Conversation

@Bojun-Feng
Copy link

@Bojun-Feng Bojun-Feng commented Jan 23, 2026

Why I did it

Fix sonic-net/sonic-buildimage#24803

On chassis platforms (VOQ, fabric, chassis-packet, DPU), syncd generates false-alarm "WD exceeded" errors during initialization. The default 30-second watchdog timeout is too short for these platforms.

Initially, I proposed statically extending the timeout, but as pointed out in the reviews, a permanent extension effectively neutralizes the watchdog's ability to detect hung operations during steady-state operation. We need a way to tolerate slow initialization without compromising ongoing system monitoring.

How I did it

Added a two-timeout setup: a longer timeout for the initialization phase, which automatically drops back to the standard timeout once init is done.

  1. Added optional -W for the init timeout (similar to the existing -w flag introduced in [syncd] Add the possibility to overwrite the watchdog warning time span via the command line option. #1243).
  2. syncd starts using the -W timeout. Once APPLY_VIEW succeeds, syncd automatically switches the watchdog back to the -w timeout.
  3. For normal switches, if -W isn't provided, it defaults to value of -w for backward compatibility. However, default -W value is extended for chassis platforms:
    • voq/chassis-packet/dpu: 150s (5x default)
    • fabric: 300s (10x default)

Why not the swss way (C++ DB read)

Context: There was timeout adjustments done in SWSS for Chassis platforms, they just read from CONFIG_DB:
orchagent/main.cpp#L693-L717

However, syncd is configured via its -w command-line option in this repo. Reading CONFIG_DB in C++ at syncd startup would add a new DB dependency and ordering risk, while the value is already available in this script via sonic-cfggen. This keeps the watchdog setting with the component that consumes it and aligns with existing syncd init patterns.

Also, the nvidia-bluefield platform is already using this approach here to change their timeout: syncd/scripts/syncd_init_common.sh#L552.

How to verify it

Deploy on a chassis platform (VOQ/fabric) and verify no false "WD exceeded" errors during syncd initialization.

Description for the changelog

Implement dual-timeout watchdog in syncd to prevent false-alarms during chassis initialization

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

* Add set_watchdog_timeout() to prevent false "WD exceeded" errors
* Set 150s timeout for voq/chassis-packet/dpu (5x default)
* Set 300s timeout for fabric (10x default)
* Preserve platform-specific timeouts (e.g., nvidia-bluefield)

Signed-off-by: Bojun Feng <bojundf@gmail.com>
@Bojun-Feng Bojun-Feng force-pushed the bug/chassis_false_timeout_msg branch from ed65157 to a53d9d5 Compare January 24, 2026 06:11
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

* Add -W flag for extended init timeout, -w for normal timeout
* Transition to normal timeout automatically after APPLY_VIEW
* Shell script uses multipliers (5x voq/dpu, 10x fabric) matching SWSS

Signed-off-by: Bojun Feng <bojundf@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Bojun Feng <bojundf@gmail.com>
@Bojun-Feng Bojun-Feng force-pushed the bug/chassis_false_timeout_msg branch from f43b673 to c3991f8 Compare February 16, 2026 14:30
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Bojun-Feng Bojun-Feng changed the title syncd: Extend watchdog timeout for chassis platforms syncd: Add -W flag for platform specific watchdog timeout during initialization Feb 17, 2026
@Bojun-Feng
Copy link
Author

Bojun-Feng commented Feb 17, 2026

Updated the PR description with the full mechanics of the new -W flag and pushed the dual-timeout solution.

One minor issue I want to call out: We have a duplicated 30s default. It now lives once in CommandLineOptions.cpp and again in syncd_init_common.sh. This has been the case before this PR as well (although we were using hard coded strings in the bash file instead of explicit variables).

It annoyed me for a moment, and I considered just stripping the default value entirely from the C++ side. We can strictly enforce that -w must be passed, so there is a single source of truth for default values.

The catch is I'm not 100% sure if syncd gets invoked in other test harnesses, unit tests, or obscure paths without the bash wrapper. If it does, removing the C++ default might cause crashes. Given the issue is related to warning messages and does not cause regression / crashes, I'm not sure if the risk is worth it.

For now I am leaving the duplicate defaults as they are. If syncd is never invoked elsewhere and we can ensure passing in the -w flag (or if there is a more elegant way to extract the default timeout), please let me know.

Signed-off-by: Bojun Feng <bojundf@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Bojun Feng <bojundf@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Bojun-Feng
Copy link
Author

Hi @kcudnik, just checking in — I've addressed all feedback and the CI is green. When you get a chance, could you take another look? Thanks!

@Bojun-Feng
Copy link
Author

Hi @kcudnik, hope you're doing well. Just a quick nudge on this PR—let me know if there's anything else you'd like me to polish up before approval. Thanks!

kcudnik
kcudnik previously approved these changes Mar 9, 2026
anamehra
anamehra previously approved these changes Mar 10, 2026
@Bojun-Feng
Copy link
Author

Bojun-Feng commented Mar 10, 2026

Hi @deepak-singhal0408, the logic is approved and passing CI. I narrowed the scope of the -W flag to only cover the slow paths (applyView/inspectAsic) during initialization.

When you have a moment, would you mind testing it to verify the fix on chassis platforms? Once verified, we can get this merged.

@deepak-singhal0408
Copy link
Contributor

Hi @deepak-singhal0408, the logic is approved and passing CI. I narrowed the scope of the -W flag to only cover the slow paths (applyView/inspectAsic) during initialization.

When you have a moment, would you mind testing it to verify the fix on chassis platforms? Once verified, we can get this merged.

@Bojun-Feng , hope you have done some basic validation? we have virtual chassis support available in sonic-mgmt.. did you get a chance to verify it there? If not, please verify. if yes, i think we can get this merged..

@Bojun-Feng
Copy link
Author

Hi @deepak-singhal0408, thanks for the pointer! I wasn't aware of the virtual chassis support in sonic-mgmt. Will verify it and post an update.

mlok-nokia
mlok-nokia previously approved these changes Mar 10, 2026
@arlakshm
Copy link
Contributor

@Bojun-Feng, can you please update the PR. with sairedis record for before and after the fix

@Bojun-Feng
Copy link
Author

Bojun-Feng commented Mar 16, 2026

Hi @deepak-singhal0408 @arlakshm, wanted to give an update on validation. I used the sonic-mgmt virtual chassis testbed (vms-kvm-t2 topology), here are more details:

Setup & Logs
  • Testbed: 3-node virtual chassis (2 linecards + 1 supervisor), multi-ASIC VOQ
  • DUT: vlab-t2-1-1 (linecard, 2 ASICs: syncd0 + syncd1)
  • switch_type: voq
  • Image: sonic-vs.img.gz built from latest master

Here are some related log snippets:

sairedis.rec — SAI switch creation timing (ASIC 0)

2026-03-16.14:05:53.801384|a|INIT_VIEW
2026-03-16.14:05:53.802675|A|SAI_STATUS_SUCCESS
2026-03-16.14:05:53.804040|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|...
2026-03-16.14:05:54.400751|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID

syslog — watchdog time span reports

vlab-t2-1-1 NOTICE syncd1#syncd: :- threadFunction: time span 691 ms for 'create:SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000'
vlab-t2-1-1 NOTICE syncd0#syncd: :- threadFunction: time span 668 ms for 'create:SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000'

WD exceeded errors

$ grep -c "WD exceeded" /var/log/syslog
0

On the VS platform, create:SAI_OBJECT_TYPE_SWITCH completes in ~600–700 ms, so the original issue (false watchdog alarms caused by >30 s initialization on chassis platforms) does not reproduce organically there.

As an additional logic check, I can set -W to an extremely small value (for example, ~100 ms) and confirm that we see timeout errors around the intended functions. That would validate the mechanism, although it would still not reproduce the original hardware-specific symptom.

What VS cannot prove is the real slow-init behavior seen on physical chassis hardware. The extended timeout in this PR is scoped only to the initialization slow paths (applyView / inspectAsic), as identified during code review with @kcudnik. I could not directly prove on VS that these are the slow paths on physical chassis hardware.

If someone with access to chassis hardware can run a quick before/after comparison, that would be the most direct validation for the fix.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Bojun-Feng
Copy link
Author

Latest push was a rebase on top of current master for new test file unrelated to the issue. There is no code change to the previously approved fix.

@deepak-singhal0408
Copy link
Contributor

As an additional logic check, I can set -W to an extremely small value (for example, ~100 ms) and confirm that we see timeout errors around the intended functions. That would validate the mechanism, although it would still not reproduce the original hardware-specific symptom.

@Bojun-Feng , thanks. lets verify this and once done, we can merge the PR.
logic looks fine and should work in general. Once the fix is integrated, it will be verified on Physical chassis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Bug: [Chassis]: ERR syncd0#syncd: :- threadFunction: time span WD exceeded 30673 ms for create:SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000\n"

7 participants