-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disconnected runtime now stops the agent loop #6829
base: main
Are you sure you want to change the base?
Conversation
async def stop_agent_loop_for_error(self): | ||
if self.controller is not None: | ||
await self.controller.set_agent_state_to(AgentState.ERROR) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this method as...
- Contrary to it's name, It does not actually stop the agent loop
- Stopping the agent loop should be done in the conversation manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't stop in the sense of closing it, but it paused it, basically, right? And then the UI waited for the user to take action.
Can't we try, on refresh, to reconnect the runtime? |
That is what this PR does - before this PR, the problem was that the agent loop would be running, and therefore would not restart. Now, a disconnected runtime triggers the agent loop to stop, so that a page refresh will restart the agent loop and thereby trigger a reconnect / restart of the runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right we need a solution for that behavior, thank you for this.
I'm not sure this is quite the way to do it, but I may be wrong. The message callback channel was intended for displaying the error strings in the UI, using it to close the entire session is a bit surprising. What if we reconnect the runtime at refresh, and close the old agent loop if it was disconnected, at that time, at refresh time, does that make sense?
On a side note, I wonder what will happen when the user has more useful things to do even without a runtime available right now: right now they can chat with the LLM, and they can try to create a delegate (these actions are not runnable actions, so they don't require a runtime). What if we make a summarization tool, the user could use? (it's not runtime either), or integrate MCP? Just wondering, maybe I'm missing something, would they be possible with this PR? |
I like this approach better. I'll update the PR. |
openhands/server/conversation_manager/standalone_conversation_manager.py
Outdated
Show resolved
Hide resolved
…:All-Hands-AI/OpenHands into fix-disconnected-runtime-stop-agent-loop
End-user friendly description of the problem this fixes or functionality that this introduces
The AgentLoop is now stopped when it detects that a runtime disconnects.
Give a summary of what the PR does, explaining any non-trivial design decisions
If the runtime stops (Possibly due to an external error, out of memory, or issue with Kubernetes / Docker), the server would continue without a runtime, spewing errors but not actually handling the error properly. After this change, the AgentLoop will stop after the final error is emitted.
Example
data:image/s3,"s3://crabby-images/8b53f/8b53f49be9f6b733f15c4c3ce2dc21e3e2602c4f" alt="image"
This silly conversation...
If the docker container is deleted...
data:image/s3,"s3://crabby-images/17eca/17ecaeb9588dfda06c14586116d467afbbc65f30" alt="image"
On main subsequent prompts fail, telling users to refresh the page...
data:image/s3,"s3://crabby-images/00a84/00a84324fd49506bbbe3d9203defb5374ed88dd8" alt="image"
But refreshing the page does not clear the issue - the Agent remains in the Error state.
After the change, a page refresh will restart the runloop. (Because it has been stopped!) The agent is still aware that something went wrong, as evidenced by the output from a continue prompt:
data:image/s3,"s3://crabby-images/77baa/77baac6eb9aaf2b663c32055a758297ff61e89f6" alt="image"
Link of any specific issues this addresses
To run this PR locally, use the following command: