-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky system test: Tedge Agent.Workflows.Custom Operation.Trigger Agent Restart #3415
Comments
This will need looking into it looks to be a legitimate failure where the restarting of the agent was not successful, though the logs do show that the agent did start shutting down, but the validation step which checks if the PID had changed is run as the tedge-agent is still shutting down. |
A shutdown that took nearly 10 seconds is the cause of this failure. Here is the relevant excerpt from the `restart-tedge-agent workflow: [restart]
background_script = "sudo systemctl restart tedge-agent"
on_exec = "restarting"
[restarting]
script = "/etc/tedge/operations/tedge-agent-pid.sh test ${.payload.tedge-agent-pid}"
timeout_second = 10
on_success = "tedge-agent-restarted"
on_kill = { status = "failed", reason = "tedge-agent not restarted" } The
But the restart completed 10 seconds later at
But the operation failed within the 10 second timeout that was configured for the |
While we still need to investigate why the |
While investigating why the
But, while analysing the logs of successful test runs, it was observed that the agent shutdown immediately after transitioning to the
So, it appears that the timing of the shutdown, whether processed before or after a script execution is triggered, makes the difference here. |
The script (associated to the restarting state and actively checking that the agent restarts) The workflow is expected to work in both cases.
=> The workflow engine is working as expected except for shutdown taking more than 60 seconds. |
Describe the bug
Flaky test:
Tedge Agent.Workflows.Custom Operation.Trigger Agent Restart
The workflow log indicates the following failure:
The following
ChannelError(SendError(SendError { kind: Disconnected }))
is captured in thetedge-agent
log as well:Failed Instances
The text was updated successfully, but these errors were encountered: