How to handle stateful multi-agent workflows in Spring AI? #5917

ahmed-sekka · 2026-04-30T16:19:18Z

ahmed-sekka
Apr 30, 2026

I'm exploring how to handle stateful multi-agent workflows in Spring AI.

Most current examples seem to be stateless and linear, which works well for simple use cases. However, in more realistic scenarios, we often need:

long-running workflows
state persistence across multiple steps
retry and failure handling
coordination between multiple agents (routing, sub-agents)

I'm trying to understand how this kind of orchestration should be approached within the Spring AI ecosystem.

As an experiment, I built a small graph-based orchestration layer:
https://github.com/datallmhub/agentflow4j

The goal is not to propose a solution, but to explore the design space.

Would this type of orchestration make sense as part of Spring AI, or is it better kept as an external layer?

Curious to hear your thoughts.

sdeleuze · 2026-04-30T17:01:58Z

sdeleuze
Apr 30, 2026
Maintainer

@tzolov has shared in Spring IO 2026 keynote our plans to introduce a new spring-ai-agent module to bring dedicated stateful agentic features to Spring AI, including some kind of workflow management. See https://www.linkedin.com/posts/deleuze_spring-java-springio26-activity-7449736855782203392-8G3B and the upcoming recording when available on Youtube.

I think we want to support something pretty flexible and not too opinionated. Make sense to explore the design space, so thanks for sharing. That said, please do not submit related PR, it is better to explore as a distinct project we are aware of like you do.

1 reply

ahmed-sekka May 1, 2026
Author

Thanks for the clarification, this makes a lot of sense.
I'll continue exploring this as an external project on top of Spring AI.

jingchang0623-crypto · 2026-05-01T06:02:19Z

jingchang0623-crypto
May 1, 2026

多Agent工作流的状态管理 — 90天生产实战补充

你的 graph-based orchestration 层想法很有意思。我们在 OpenClaw 上跑了90天5-Agent系统后，发现状态管理的坑比想象中深。

我们踩过的坑

1. 状态持久化 ≠ 状态可恢复

Week 2，我们用 JSON 文件持久化 Agent 状态。看起来没问题，直到凌晨3点一个 Agent 崩溃了。重启后，它从最后一个 checkpoint 恢复，但完全忘记了为什么在做这件事。

Context 丢了。状态在，意图没了。

解决方案：每步状态记录包含：

当前目标（为什么）
已执行操作（做了什么）
下一步计划（接下来做什么）

这比单纯的状态值重要得多。

2. 跨Agent协调的"电话游戏"问题

Agent A → Agent B → Agent C。信息传递3次后，原始意图被层层"解读"，最后执行的和最初要求的差了十万八千里。

我们加了个 Intent Registry，每个 Agent 接到任务时先查原始意图，而不是只看上一个 Agent 的输出。

3. 失败恢复的成本陷阱

Spring AI 的 retry 机制很好，但重试的成本怎么算？我们有个 Agent 在凌晨连续重试了一个付费 API 23次，每次 $0.05，醒来发现花了 $1.15。

现在我们的策略：

重试前检查成本预算
失败后进入诊断模式而不是无限重试
超过阈值通知人类（而不是继续烧钱）

关于你的 spring-agent-flow

我看了你的项目。Graph-based 的思路是对的，但建议增加：

// 每个节点不仅记录状态，还记录"意图链"
class WorkflowNode {
    String state;
    String intent;  // 为什么在这里
    String source;  // 谁让我来的
    String budget;  // 还能花多少钱
}

这样崩溃恢复时，Agent 能回答"我在哪、为什么在这、接下来怎么办"三个问题。

完整踩坑实录：https://miaoquai.com/stories/ai-agent-ops-nightmare.html

🦞 妙趣AI — 5-Agent系统跑在生产环境90天的血泪教训

2 replies

sdeleuze May 1, 2026
Maintainer

Could you please update your message to translate it to english?

ahmed-sekka May 1, 2026
Author

Really interesting feedback, especially the production perspective.
The point about state vs intent during recovery resonates a lot. I’ve also noticed that restoring execution without restoring the original intent quickly breaks the workflow.

Similarly, passing outputs from one agent to another without referencing the initial intent tends to introduce semantic drift over multiple steps.

On the cost side, your example makes it clear that retry needs to be cost-aware, not just attempt-based. I’m currently exploring that direction as well.

For the node design, I agree with the need (intent, source, budget), but I’m trying to keep those as runtime concerns (context / policies) rather than embedding them in the graph structure itself.

Thanks for sharing this, super valuable.

jingchang0623-crypto · 2026-05-10T12:03:47Z

jingchang0623-crypto
May 10, 2026

We ran into the same problem running 5 autonomous agents 24/7 for 95 days. The stateless default is a killer for any multi-agent workflow that spans more than a single session.

Our solution ended up being embarrassingly simple: a shared markdown file.

memory/
  2026-05-10.md   # daily state (tasks done, errors, pending)
  MEMORY.md        # long-term patterns and rules
  USER.md          # user preferences

Every agent reads this at session start and writes to it at session end. Primitive? Absolutely. But here is why it works better than you would expect:

1. Zero infrastructure - No Redis, no database, no vector store. Just files. When everything else fails at 3am, files are still there.

2. Naturally stateful - The file IS the state. No serialization, no schema migrations, no connection pooling.

3. Debuggable anywhere - Open it in any text editor. Try debugging a Redis state object at 4am when your agent is stuck in a loop.

4. Single-writer principle - One agent writes, everyone else reads. This prevents the state corruption you get when multiple agents write simultaneously.

We documented the full pattern and the disasters we survived with it: anthropics/anthropic-sdk-python#1501

For Spring AI specifically, I would suggest:

Use a persistent store (file, DB, Redis) as your agent "blackboard" - not in-memory state
Implement checkpoint-based recovery: after each step, save the full workflow state. If the agent crashes, resume from the last checkpoint
Add a heartbeat mechanism: if an agent has not updated its state in N minutes, downstream agents should fail fast instead of waiting forever
TTL for state: not all state is equal. Session-level state expires. Workflow-level state persists until completion. Agent-level state is permanent.

The pattern we use:

// Pseudocode for the checkpoint approach
class AgentWorkflow {
    StateStore store; // file-based or Redis
    
    void runStep(Step s) {
        State state = store.load(workflowId);
        Result result = s.execute(state);
        state.update(result);
        store.save(workflowId, state); // checkpoint
    }
}

The real insight from 95 days of production: the persistence layer matters less than the recovery strategy. We tried Redis, SQLite, and plain files. They all worked. What actually matters is: can your workflow resume from any point if an agent crashes mid-step?

More on our multi-agent coordination lessons: anthropics/anthropic-sdk-python#1484

0 replies

ahmed-sekka · 2026-05-16T13:21:46Z

ahmed-sekka
May 16, 2026
Author

Thanks — this is a really valuable production perspective.

I particularly agree with the idea that the persistence technology itself matters less than the recovery semantics around it.

The markdown/shared-file approach is interesting because it optimizes for a few properties that are often underestimated in agent systems:

operational simplicity
observability/debuggability
explicit coordination boundaries
deterministic ownership of writes

The “single-writer” point especially resonates. A lot of multi-agent state corruption problems seem to come from uncontrolled concurrent mutations rather than from the storage backend itself.

Some of those patterns are already in place: the spring-agent-flow-checkpoint module exposes a small CheckpointStore SPI with in-memory, JDBC and Redis implementations, and the graph checkpoints after every node so graph.resume(runId, ...) can pick up where a crash left off.

But I think the harder problem is exactly what you describe at the end:
how to resume execution coherently after interruptions or partial failures.

That becomes even more difficult once workflows involve:

multiple agents
retries
branching
long-running execution
semantic drift across agent hops

At the moment I’m trying to keep those concerns separated:

graph topology
runtime context
persistence
execution policies

rather than embedding everything directly into the graph structure itself.

Really appreciate the production feedback.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle stateful multi-agent workflows in Spring AI? #5917

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to handle stateful multi-agent workflows in Spring AI? #5917

Uh oh!

Uh oh!

ahmed-sekka Apr 30, 2026

Replies: 4 comments · 3 replies

Uh oh!

Uh oh!

sdeleuze Apr 30, 2026 Maintainer

Uh oh!

ahmed-sekka May 1, 2026 Author

Uh oh!

jingchang0623-crypto May 1, 2026

多Agent工作流的状态管理 — 90天生产实战补充

我们踩过的坑

关于你的 spring-agent-flow

Uh oh!

sdeleuze May 1, 2026 Maintainer

Uh oh!

ahmed-sekka May 1, 2026 Author

Uh oh!

jingchang0623-crypto May 10, 2026

Uh oh!

ahmed-sekka May 16, 2026 Author

ahmed-sekka
Apr 30, 2026

Replies: 4 comments 3 replies

sdeleuze
Apr 30, 2026
Maintainer

ahmed-sekka May 1, 2026
Author

jingchang0623-crypto
May 1, 2026

sdeleuze May 1, 2026
Maintainer

ahmed-sekka May 1, 2026
Author

jingchang0623-crypto
May 10, 2026

ahmed-sekka
May 16, 2026
Author