How to handle stateful multi-agent workflows in Spring AI? #5917
Replies: 4 comments 3 replies
-
|
@tzolov has shared in Spring IO 2026 keynote our plans to introduce a new I think we want to support something pretty flexible and not too opinionated. Make sense to explore the design space, so thanks for sharing. That said, please do not submit related PR, it is better to explore as a distinct project we are aware of like you do. |
Beta Was this translation helpful? Give feedback.
-
多Agent工作流的状态管理 — 90天生产实战补充你的 graph-based orchestration 层想法很有意思。我们在 OpenClaw 上跑了90天5-Agent系统后,发现状态管理的坑比想象中深。 我们踩过的坑1. 状态持久化 ≠ 状态可恢复 Week 2,我们用 JSON 文件持久化 Agent 状态。看起来没问题,直到凌晨3点一个 Agent 崩溃了。重启后,它从最后一个 checkpoint 恢复,但完全忘记了为什么在做这件事。 Context 丢了。状态在,意图没了。 解决方案:每步状态记录包含:
这比单纯的状态值重要得多。 2. 跨Agent协调的"电话游戏"问题 Agent A → Agent B → Agent C。信息传递3次后,原始意图被层层"解读",最后执行的和最初要求的差了十万八千里。 我们加了个 3. 失败恢复的成本陷阱 Spring AI 的 retry 机制很好,但重试的成本怎么算?我们有个 Agent 在凌晨连续重试了一个付费 API 23次,每次 $0.05,醒来发现花了 $1.15。 现在我们的策略:
关于你的 spring-agent-flow我看了你的项目。Graph-based 的思路是对的,但建议增加: // 每个节点不仅记录状态,还记录"意图链"
class WorkflowNode {
String state;
String intent; // 为什么在这里
String source; // 谁让我来的
String budget; // 还能花多少钱
}这样崩溃恢复时,Agent 能回答"我在哪、为什么在这、接下来怎么办"三个问题。 完整踩坑实录:https://miaoquai.com/stories/ai-agent-ops-nightmare.html 🦞 妙趣AI — 5-Agent系统跑在生产环境90天的血泪教训 |
Beta Was this translation helpful? Give feedback.
-
|
We ran into the same problem running 5 autonomous agents 24/7 for 95 days. The stateless default is a killer for any multi-agent workflow that spans more than a single session. Our solution ended up being embarrassingly simple: a shared markdown file. Every agent reads this at session start and writes to it at session end. Primitive? Absolutely. But here is why it works better than you would expect: 1. Zero infrastructure - No Redis, no database, no vector store. Just files. When everything else fails at 3am, files are still there. 2. Naturally stateful - The file IS the state. No serialization, no schema migrations, no connection pooling. 3. Debuggable anywhere - Open it in any text editor. Try debugging a Redis state object at 4am when your agent is stuck in a loop. 4. Single-writer principle - One agent writes, everyone else reads. This prevents the state corruption you get when multiple agents write simultaneously. We documented the full pattern and the disasters we survived with it: anthropics/anthropic-sdk-python#1501 For Spring AI specifically, I would suggest:
The pattern we use: // Pseudocode for the checkpoint approach
class AgentWorkflow {
StateStore store; // file-based or Redis
void runStep(Step s) {
State state = store.load(workflowId);
Result result = s.execute(state);
state.update(result);
store.save(workflowId, state); // checkpoint
}
}The real insight from 95 days of production: the persistence layer matters less than the recovery strategy. We tried Redis, SQLite, and plain files. They all worked. What actually matters is: can your workflow resume from any point if an agent crashes mid-step? More on our multi-agent coordination lessons: anthropics/anthropic-sdk-python#1484 |
Beta Was this translation helpful? Give feedback.
-
|
Thanks — this is a really valuable production perspective. I particularly agree with the idea that the persistence technology itself matters less than the recovery semantics around it. The markdown/shared-file approach is interesting because it optimizes for a few properties that are often underestimated in agent systems:
The “single-writer” point especially resonates. A lot of multi-agent state corruption problems seem to come from uncontrolled concurrent mutations rather than from the storage backend itself. Some of those patterns are already in place: the spring-agent-flow-checkpoint module exposes a small CheckpointStore SPI with in-memory, JDBC and Redis implementations, and the graph checkpoints after every node so graph.resume(runId, ...) can pick up where a crash left off. But I think the harder problem is exactly what you describe at the end: That becomes even more difficult once workflows involve:
At the moment I’m trying to keep those concerns separated:
rather than embedding everything directly into the graph structure itself. Really appreciate the production feedback. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm exploring how to handle stateful multi-agent workflows in Spring AI.
Most current examples seem to be stateless and linear, which works well for simple use cases. However, in more realistic scenarios, we often need:
I'm trying to understand how this kind of orchestration should be approached within the Spring AI ecosystem.
As an experiment, I built a small graph-based orchestration layer:
https://github.com/datallmhub/agentflow4j
The goal is not to propose a solution, but to explore the design space.
Would this type of orchestration make sense as part of Spring AI, or is it better kept as an external layer?
Curious to hear your thoughts.
Beta Was this translation helpful? Give feedback.
All reactions