Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky system test: Tedge Agent.Workflows.Custom Operation.Trigger Agent Restart #3415

Open
albinsuresh opened this issue Feb 24, 2025 · 5 comments
Assignees
Labels
bug Something isn't working flaky test Label given to flaky or otherwise faulty tests freq:high

Comments

@albinsuresh
Copy link
Contributor

albinsuresh commented Feb 24, 2025

Describe the bug

Flaky test: Tedge Agent.Workflows.Custom Operation.Trigger Agent Restart

------------------------------------------------------------------------------
Trigger Agent Restart                                                 | FAIL |
Matching messages on topic 'te/device/main///cmd/restart-tedge-agent/robot-1' is less than minimum.
wanted: 1
got: 0

messages:
[]
------------------------------------------------------------------------------

The workflow log indicates the following failure:

=> moving to restart-tedge-agent @ failed

----------------------[ restart-tedge-agent @ failed | time=2025-02-24T07:45:26.240456777Z ]----------------------

State:    {"@version":"90a7644d2d8b71a2ac1d4c408468ed63c16ace780a3321dadf319e3c337de869","logPath":"/var/log/tedge/agent/workflow-restart-tedge-agent-robot-1.log","reason":"tedge-agent not restarted","resumed_at":"1740383125.0","status":"failed","tedge-agent-pid":"MainPID=72"}

Action:   wait for the requester to finalize the command

The following ChannelError(SendError(SendError { kind: Disconnected })) is captured in the tedge-agent log as well:

Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.357090472Z  INFO tedge_agent::operation_workflows::actor: Processing restart-tedge-agent operation init step with script: /etc/tedge/operations/tedge-agent-pid.sh probe
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.517532782Z  INFO tedge_api::workflow::log::command_log: => moving to restart-tedge-agent @ restart
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.51999118Z  INFO tedge_agent::operation_workflows::actor: Moving restart-tedge-agent operation to restarting state before running: sudo systemctl restart tedge-agent
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.523042905Z  INFO tedge_api::workflow::log::command_log: => moving to restart-tedge-agent @ restarting
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.579858403Z  INFO tedge_agent::operation_workflows::actor: Processing restart-tedge-agent operation restarting step with script: /etc/tedge/operations/tedge-agent-pid.sh test 'MainPID=72'
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580175285Z  INFO Runtime: Shutting down
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580407359Z  INFO tedge_agent::http_server::actor: Shutdown
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.58054166Z  INFO Runtime: Actor has finished: Downloader-7
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580551468Z  INFO Runtime: Actor has finished: MQTT-5
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580557249Z  INFO Runtime: Actor has finished: LogManager-10
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.58056276Z  INFO Runtime: Actor has finished: FsWatcher-6
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580567629Z  INFO Runtime: Actor has finished: Uploader-8
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580573079Z  INFO Runtime: Actor has finished: FileCacheActor-3
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580578309Z  INFO Runtime: Actor has finished: TedgetoTeConverter-0
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580583148Z  INFO Runtime: Actor has finished: HttpServer-1
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580587977Z  INFO Runtime: Actor has finished: Signal-Handler-4
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580592916Z  INFO Runtime: Actor has finished: RestartManagerActor-11
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580598075Z  INFO Runtime: Actor has finished: HealthMonitorActor-15
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580603336Z  INFO Runtime: Actor has finished: ConfigManager-9
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580608505Z  INFO Runtime: Actor has finished: SoftwareManagerActor-12
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.580614186Z  INFO Runtime: Actor has finished: EntityStoreServer-2
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.58066448Z  INFO Runtime: Actor has finished: Script-13
Feb 24 07:45:15 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:15.587403377Z  INFO mqtt_channel::connection: MQTT connection closed
Feb 24 07:45:25 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:25.806827112Z  INFO tedge_api::workflow::log::command_log: => moving to restart-tedge-agent @ failed
Feb 24 07:45:25 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:25.808244662Z ERROR Runtime: Actor WorkflowActor-14 has finished unsuccessfully: ChannelError(SendError(SendError { kind: Disconnected }))
Feb 24 07:45:25 86e3c8812e23 tedge-agent[72]: 2025-02-24T07:45:25.808268286Z  INFO Runtime: All actors have finished

Failed Instances

@albinsuresh albinsuresh added the bug Something isn't working label Feb 24, 2025
@Bravo555 Bravo555 added the flaky test Label given to flaky or otherwise faulty tests label Feb 25, 2025
@reubenmiller
Copy link
Contributor

This will need looking into it looks to be a legitimate failure where the restarting of the agent was not successful, though the logs do show that the agent did start shutting down, but the validation step which checks if the PID had changed is run as the tedge-agent is still shutting down.

@reubenmiller reubenmiller removed the flaky test Label given to flaky or otherwise faulty tests label Feb 25, 2025
@Bravo555 Bravo555 added the flaky test Label given to flaky or otherwise faulty tests label Feb 26, 2025
@albinsuresh albinsuresh self-assigned this Feb 27, 2025
@albinsuresh
Copy link
Contributor Author

albinsuresh commented Mar 4, 2025

A shutdown that took nearly 10 seconds is the cause of this failure. Here is the relevant excerpt from the `restart-tedge-agent workflow:

[restart]
background_script = "sudo systemctl restart tedge-agent"
on_exec = "restarting"

[restarting]
script = "/etc/tedge/operations/tedge-agent-pid.sh test ${.payload.tedge-agent-pid}"
timeout_second = 10
on_success = "tedge-agent-restarted"
on_kill = { status = "failed", reason = "tedge-agent not restarted" }

The restarting step expects the restart to finish within 10 seconds. But, in this failed instance, after initiating the shutdown at 11:02:15 the tedge-agent-pid.sh test script in the restarting step started waiting as follows:

Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.746644761Z  INFO tedge_agent::operation_workflows::actor: Processing restart-tedge-agent operation restarting step with script: /etc/tedge/operations/tedge-agent-pid.sh test 'MainPID=65'
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750150648Z  INFO Runtime: Shutting down
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750514105Z  INFO Runtime: Actor has finished: MQTT-5
...
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.757794078Z  INFO mqtt_channel::connection: MQTT connection closed

But the restart completed 10 seconds later at 11:02:25 because the WorkflowActor did not shutdown immediately.

Mar 03 11:02:25 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:25.98133186Z  INFO tedge_api::workflow::log::command_log: => moving to restart-tedge-agent @ failed
Mar 03 11:02:25 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:25.983760921Z ERROR Runtime: Actor WorkflowActor-14 has finished unsuccessfully: ChannelError(SendError(SendError { kind: Disconnected }))
Mar 03 11:02:25 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:25.983778414Z  INFO Runtime: All actors have finished

But the operation failed within the 10 second timeout that was configured for the restarting step.

@albinsuresh
Copy link
Contributor Author

albinsuresh commented Mar 4, 2025

While we still need to investigate why the WorkflowActor did not shutdown on time, we can temporarily fix the flakiness by using a timeout_second value that is higher than the maximum timeout of the runtime to force kill any hung actors (60 seconds).

@albinsuresh
Copy link
Contributor Author

albinsuresh commented Mar 5, 2025

While investigating why the WorkflowActor took 10 seconds to shutdown, it was observed that the workflow, after transitioning from restart to restarting state, had started the execution of the script associated with that restarting state. And since the script was configured to timeout after 10 seconds, the actor had to wait that long before it could process the shutdown request. Here is the agent log excerpt from the failed instance(with the relevant lines higlighted with ###):

Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.494610375Z  INFO tedge_agent::operation_workflows::actor: Processing restart-tedge-agent operation init step with script: /etc/tedge/operations/tedge-agent-pid.sh probe
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.596132337Z  INFO tedge_api::workflow::log::command_log: => moving to restart-tedge-agent @ restart
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.598178076Z  INFO tedge_agent::operation_workflows::actor: Moving restart-tedge-agent operation to restarting state before running: sudo systemctl restart tedge-agent
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.600628428Z  INFO tedge_api::workflow::log::command_log: => moving to restart-tedge-agent @ restarting
###
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.746644761Z  INFO tedge_agent::operation_workflows::actor: Processing restart-tedge-agent operation restarting step with script: /etc/tedge/operations/tedge-agent-pid.sh test 'MainPID=65'
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750150648Z  INFO Runtime: Shutting down
###
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750256876Z  INFO tedge_agent::http_server::actor: Shutdown
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750493256Z  INFO Runtime: Actor has finished: LogManager-10
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750503204Z  INFO Runtime: Actor has finished: FileCacheActor-3
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750508755Z  INFO Runtime: Actor has finished: HttpServer-1
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750514105Z  INFO Runtime: Actor has finished: MQTT-5
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750519525Z  INFO Runtime: Actor has finished: FsWatcher-6
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750525115Z  INFO Runtime: Actor has finished: RestartManagerActor-11
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750530565Z  INFO Runtime: Actor has finished: HealthMonitorActor-15
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750535695Z  INFO Runtime: Actor has finished: Script-13
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750540814Z  INFO Runtime: Actor has finished: Uploader-8
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750545924Z  INFO Runtime: Actor has finished: SoftwareManagerActor-12
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750551074Z  INFO Runtime: Actor has finished: Downloader-7
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750555882Z  INFO Runtime: Actor has finished: EntityStoreServer-2
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750560862Z  INFO Runtime: Actor has finished: ConfigManager-9
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750566462Z  INFO Runtime: Actor has finished: Signal-Handler-4
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.750582532Z  INFO Runtime: Actor has finished: TedgetoTeConverter-0
Mar 03 11:02:15 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:15.757794078Z  INFO mqtt_channel::connection: MQTT connection closed
Mar 03 11:02:25 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:25.98133186Z  INFO tedge_api::workflow::log::command_log: => moving to restart-tedge-agent @ failed
Mar 03 11:02:25 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:25.983760921Z ERROR Runtime: Actor WorkflowActor-14 has finished unsuccessfully: ChannelError(SendError(SendError { kind: Disconnected }))
Mar 03 11:02:25 8fdc23911090 tedge-agent[65]: 2025-03-03T11:02:25.983778414Z  INFO Runtime: All actors have finished

But, while analysing the logs of successful test runs, it was observed that the agent shutdown immediately after transitioning to the restarting state, but before starting the execution of the script in that step as follows:

Mar 04 07:47:02 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:02.977896882Z  INFO tedge_agent::operation_workflows::actor: Processing restart-tedge-agent operation init step with script: /etc/tedge/operations/tedge-agent-pid.sh probe
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.030997469Z  INFO tedge_api::workflow::log::command_log: => moving to restart-tedge-agent @ restart
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.033825629Z  INFO tedge_agent::operation_workflows::actor: Moving restart-tedge-agent operation to restarting state before running: sudo systemctl restart tedge-agent
###
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.040206436Z  INFO tedge_api::workflow::log::command_log: => moving to restart-tedge-agent @ restarting
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.051979516Z  INFO Runtime: Shutting down
###
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.052311096Z  INFO tedge_agent::http_server::actor: Shutdown
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.052608572Z  INFO Runtime: Actor has finished: RestartManagerActor-11
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053113484Z  INFO Runtime: Actor has finished: EntityStoreServer-2
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053126313Z  INFO Runtime: Actor has finished: FileCacheActor-3
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053136637Z  INFO Runtime: Actor has finished: Uploader-8
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053143392Z  INFO Runtime: Actor has finished: SoftwareManagerActor-12
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053149609Z  INFO Runtime: Actor has finished: FsWatcher-6
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053155294Z  INFO Runtime: Actor has finished: Script-13
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053162599Z  INFO Runtime: Actor has finished: Downloader-7
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053168864Z  INFO Runtime: Actor has finished: MQTT-5
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053172879Z  INFO Runtime: Actor has finished: TedgetoTeConverter-0
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053176811Z  INFO Runtime: Actor has finished: Signal-Handler-4
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053181159Z  INFO Runtime: Actor has finished: HealthMonitorActor-15
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053184892Z  INFO Runtime: Actor has finished: LogManager-10
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053188609Z  INFO Runtime: Actor has finished: ConfigManager-9
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.053192443Z  INFO Runtime: Actor has finished: HttpServer-1
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.054860793Z  INFO Runtime: Actor has finished: WorkflowActor-14
Mar 04 07:47:03 cf36b377f991 tedge-agent[816]: 2025-03-04T07:47:03.054900738Z  INFO Runtime: All actors have finished

So, it appears that the timing of the shutdown, whether processed before or after a script execution is triggered, makes the difference here.

@didier-wenzek
Copy link
Contributor

The script (associated to the restarting state and actively checking that the agent restarts)
is launched by the workflow engine only after the back-ground script (restarting the agent) actually returns.
However, the command to restart the agent might not be immediately effective,
giving the time for the agent to launch the script occasionally before it is restarted.

The workflow is expected to work in both cases.

  • First case, the agent restarts before the script is launched.
    The workflow state (a.k.a restarting) has been persisted before the back-ground script restarting the agent has been launched, as a result, on restart, the agent resumes this state and launches the script.
  • Second case, the agent launches the script before being killed by the restart command.
    As a result, the script is killed when the agent restarts.
    However, things are not properly handled here
    (see Flaky system test: Tedge Agent.Workflows.Custom Operation.Trigger Agent Restart #3415 (comment) where the workflow jumps to the fail state when killed on restart).
    In order to have the script resumed on-restart, the workflow must loop on-kill to the restarting state
    And the script has to be improved to handle the timeout case
    (the workflow engine cannot distinguish a kill due to a restart from a kill due to a timeout).

=> The workflow engine is working as expected except for shutdown taking more than 60 seconds.
=> The restart workflow used by this test is fragile and can be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky test Label given to flaky or otherwise faulty tests freq:high
Projects
None yet
Development

No branches or pull requests

4 participants