Skip to content

Adjust select timeouts during port update handling to allow for faster transceiver DOM polling#758

Open
aditya-nexthop wants to merge 10 commits intosonic-net:masterfrom
nexthop-ai:aditya-shorten-wait-timeout
Open

Adjust select timeouts during port update handling to allow for faster transceiver DOM polling#758
aditya-nexthop wants to merge 10 commits intosonic-net:masterfrom
nexthop-ai:aditya-shorten-wait-timeout

Conversation

@aditya-nexthop
Copy link
Contributor

@aditya-nexthop aditya-nexthop commented Feb 21, 2026

Description

This PR optimizes the DOM (Digital Optical Monitoring) polling loop in xcvrd by improving port update event handling and reducing unnecessary wait times. The changes include:

  1. Refactored port update handling - Extracted port update event processing into a dedicated check_port_update() method for better code organization and reusability
  2. Optimized timeout strategy - Introduced two-tier timeout mechanism:
    • 1000ms timeout when waiting for port updates before DOM polling begins and after when it completes until the next polling cycle
    • 100ms timeout during DOM polling to minimize delays while still handling port events
  3. Improved loop structure - Restructured the main loop to handle port updates before entering the DOM polling phase, preventing unnecessary blocking
  4. Added comprehensive unit tests - Created tests covering various scenarios including multiple ports, timing edge cases, and stop event handling

fixes #759 and builds on #757

Motivation and Context

Before the change, the DOM monitoring loop would wait up to 1 second (SELECT_TIMEOUT_MSECS) for port update events during each iteration of the physical port loop. This caused significant delays in DOM data collection, especially on systems with many ports.

Problem: With the 1-second timeout being called for every physical port, the DOM polling could take an excessive amount of time to complete, delaying DOM polling updates.

Solution: By separating port update handling from DOM polling and using a shorter 100ms timeout during the polling phase, the loop can complete much faster while still being responsive to port change events. The 1-second timeout is only used when explicitly waiting for port updates before starting the next DOM polling cycle.

How Has This Been Tested?

  1. Unit Tests Added - Comprehensive test coverage for the new check_port_update() method including:

    • Scenario with no link change affected ports
    • Link change affected port with past timestamp (should trigger immediate update)
    • Link change affected port with future timestamp (should defer update)
    • Multiple ports with mixed ready/not-ready states
    • Stop event handling during processing
  2. CPU usage profiling: Measured for 10 minutes after restarting xcvrd on a switch fully populated with optical transceivers.

image - After the change image

The CPU usage is slightly higher during active polling as it spends less time waiting (100ms) between interfaces and then the loop spends time waiting for port change updates in 1s chunks.

  1. Measuring actual DOM update times
  • Before change: We are not able to poll every 60 seconds (takes 90 sec+) on a switch fully populated with optical transceivers
$ while true; do sonic-db-dump -n STATE_DB -y -k "TRANSCEIVER_DOM_SENSOR|Ethernet112" | grep last_update_time; sleep 10; done | uniq
      "last_update_time": "Tue Feb 24 19:25:57 2026",
      "last_update_time": "Tue Feb 24 19:27:32 2026",
      "last_update_time": "Tue Feb 24 19:29:06 2026",
      "last_update_time": "Tue Feb 24 19:30:40 2026",
      "last_update_time": "Tue Feb 24 19:32:13 2026",
      "last_update_time": "Tue Feb 24 19:33:47 2026",
  • After change: Updates happen every 60 seconds for a specified interface
$ while true; do sonic-db-dump -n STATE_DB -y -k "TRANSCEIVER_DOM_SENSOR|Ethernet112" | grep last_update_time; sleep 10; done | uniq
      "last_update_time": "Tue Feb 24 19:17:09 2026",
      "last_update_time": "Tue Feb 24 19:18:10 2026",
      "last_update_time": "Tue Feb 24 19:19:10 2026",
      "last_update_time": "Tue Feb 24 19:20:10 2026",
      "last_update_time": "Tue Feb 24 19:21:11 2026",
      "last_update_time": "Tue Feb 24 19:22:11 2026",

Additional Information (Optional)

Key Technical Changes:

  • New constants: PORT_UPDATE_EVENT_SELECT_TIMEOUT_MSECS (1000ms) and PORT_UPDATE_EVENT_SELECT_TIMEOUT_FAST_MSECS (100ms)
  • Modified PortChangeObserver.handle_port_update_event() to accept a configurable timeout parameter
  • The periodic update interval calculation now uses dom_loop_start_time to maintain consistent intervals regardless of loop execution time

Backward Compatibility: This change is fully backward compatible and does not affect the external API or configuration.

Handle port updates before starting the DOM polling loop with a 1 sec timeout.
When looking for port updates during DOM polling, reduce timeout to 100 msec
This allows DOM polling to complete in a reasonable amount of time.

Signed-off-by: aditya-nexthop <aditya@nexthop.ai>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: aditya-nexthop <aditya@nexthop.ai>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@aditya-nexthop aditya-nexthop changed the title WIP: Move port update handling to its own function. Adjust select timeouts during port update handling to allow for faster transceiver DOM polling Feb 24, 2026
@aditya-nexthop aditya-nexthop marked this pull request as ready for review February 24, 2026 20:12
@prgeor prgeor requested review from Copilot and mihirpat1 March 10, 2026 22:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves xcvrd DOM polling responsiveness by reducing time spent blocking on port-update events, so full-port DOM polling can complete closer to the intended periodic interval.

Changes:

  • Add a configurable select timeout to PortChangeObserver.handle_port_update_event() so callers can use shorter waits during DOM polling.
  • Refactor DOM loop port-update handling into a reusable check_port_update() helper and introduce a two-tier timeout strategy (slow/fast).
  • Add unit tests for the new check_port_update() behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
sonic-xcvrd/xcvrd/xcvrd_utilities/port_event_helper.py Makes port-update event selection timeout configurable (enables fast/slow modes).
sonic-xcvrd/xcvrd/dom/dom_mgr.py Restructures the DOM update loop and introduces two timeout constants plus check_port_update().
sonic-xcvrd/tests/test_xcvrd.py Adds unit coverage for DomInfoUpdateTask.check_port_update().

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

aditya-nexthop and others added 2 commits March 10, 2026 23:29
Signed-off-by: aditya-nexthop <aditya@nexthop.ai>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: aditya-nexthop <aditya@nexthop.ai>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@aditya-nexthop aditya-nexthop force-pushed the aditya-shorten-wait-timeout branch from 5358dbb to e5c5a2f Compare March 10, 2026 23:30
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mihirpat1
Copy link
Contributor

fixes sonic-net/sonic-buildimage#22625

Signed-off-by: aditya-nexthop <aditya@nexthop.ai>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: aditya-nexthop <aditya@nexthop.ai>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@aditya-nexthop aditya-nexthop requested a review from prgeor March 11, 2026 22:03
Signed-off-by: aditya-nexthop <aditya@nexthop.ai>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: aditya-nexthop <aditya@nexthop.ai>
@aditya-nexthop aditya-nexthop force-pushed the aditya-shorten-wait-timeout branch from ba1e526 to 673d7c7 Compare March 12, 2026 22:02
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…op start time.

Signed-off-by: aditya-nexthop <aditya@nexthop.ai>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DomInfoUpdateTask polling can take longer than DOM_INFO_UPDATE_PERIOD_SECS due to 1 sec wait after every iteration

5 participants