Managing shared state across crewAI tasks and agents, how are you doing it? #4111

Arpanwanwe · 2025-12-17T08:20:15Z

Arpanwanwe
Dec 17, 2025

I have been using crewAI for Agent Workflow based on Role (Planner, Researcher, Executor, Reviewer), and it has been functioning well for structured task handoffs. Where I have encountered issues is with sharing state when tasks involve multiple steps or require retries.

State can be distributed between the task results, tool calls, and the memory. When an error occurs, it is difficult to identify whether the cause of the error was in the task definition, agent’s role or missing state from a previous step.

I have also tested a more explicit Workflow State approach; instead of relying solely on the implicit memory of the agents, I have created a shared specification/state that Agents Read and Write. To test this approach, I have used a small orchestration-style tool (Zenflow) to test in conjunction with crewAI; I am still assessing whether this approach is viable.

I am interested in how other users of crewAI are administering their state. Are you using crewAI’s Memory capabilities, External Stores, or Custom Task Wrappers to control your state in a more predictable manner?

KeepALifeUS · 2026-02-03T12:58:04Z

KeepALifeUS
Feb 3, 2026

Great question! I've been exploring this exact problem in my multi-agent projects.

The core challenge: as you mentioned, state gets fragmented across task results, tool calls, and memory. When something fails, debugging becomes painful.

What I've found works well:

Explicit shared state object — Similar to your Zenflow approach. Instead of relying on implicit agent memory, have a central state that agents read from and write to. This makes the flow deterministic and debuggable.
Stigmergy pattern — Inspired by how ant colonies coordinate. Agents don't communicate directly with each other. Instead, they read/write to a shared environment (like leaving "pheromone trails"). The benefits:
- No agent-to-agent API calls (significant token savings)
- Agents can work asynchronously
- State is always visible and traceable

I've implemented this in a production system with 4 specialized agents (Sales, Scheduler, Analyst, Coordinator) where each agent reads relevant signals from the shared environment and writes its outputs back. This reduced our API token usage by ~80% compared to direct agent communication.

If you're interested, I documented the architecture here: https://github.com/KeepALifeUS/autonomous-agents

For crewAI specifically, I think Custom Task Wrappers with an explicit state container (similar to what you're testing) is the most promising direction. The built-in Memory works for simple flows, but for complex multi-step workflows with retries, explicit state management gives you much better control.

What kind of workflows are you building? Happy to share more specific patterns if helpful.

0 replies

xXMrNidaXx · 2026-02-23T13:13:24Z

xXMrNidaXx
Feb 23, 2026

This is one of the hardest problems in multi-agent systems. Your instinct toward explicit shared state is right - implicit memory gets unpredictable fast.

What we have found works:

1. Typed state schema

from pydantic import BaseModel

class WorkflowState(BaseModel):
    current_phase: str
    research_findings: list[str] = []
    decisions_made: dict = {}
    errors: list[str] = []
    retry_count: int = 0

2. State transitions are explicit

def transition_state(state: WorkflowState, agent: str, action: str, result: Any):
    state.history.append({"agent": agent, "action": action, "result": result})
    return state

3. Error attribution
Wrap each agent call to capture exactly where failures occur:

Which agent?
What was the input state?
What did they try to do?
What failed?

4. Checkpointing
Save state after each successful step. On retry, restore from last checkpoint rather than starting over.

Our approach at Revolution AI: We use a combination of:

Pydantic models for schema
Redis for shared state (multi-agent safe)
Structured logging with correlation IDs

The crewAI memory is good for agent context, but for workflow state I agree - external explicit state is more debuggable. What does your current error logging look like?

0 replies

xXMrNidaXx · 2026-02-23T14:42:04Z

xXMrNidaXx
Feb 23, 2026

State management in multi-agent systems is hard. Here is what works:

1. Explicit state object (recommended)

from pydantic import BaseModel

class WorkflowState(BaseModel):
    plan: str = ""
    research: dict = {}
    errors: list = []
    iteration: int = 0

state = WorkflowState()

# Pass state through context
task = Task(
    description=f"Given state: {state.model_dump_json()}, do X",
    context=[previous_task],
)

2. External store (Redis/DB)

import redis

class StateStore:
    def __init__(self, workflow_id):
        self.r = redis.Redis()
        self.key = f"workflow:{workflow_id}"
    
    def get(self, field):
        return self.r.hget(self.key, field)
    
    def set(self, field, value):
        self.r.hset(self.key, field, value)

# Agents read/write via tools
@tool
def save_state(field: str, value: str):
    state_store.set(field, value)

3. CrewAI memory + explicit checkpoints

crew = Crew(
    memory=True,
    # Plus explicit state saves
)

# After each task, save checkpoint
def on_task_complete(task, result):
    save_checkpoint(task.name, result)

Debugging state issues:

# Wrap tasks to log state
class TrackedTask(Task):
    def execute(self, *args, **kwargs):
        print(f"PRE-STATE: {get_current_state()}")
        result = super().execute(*args, **kwargs)
        print(f"POST-STATE: {get_current_state()}")
        return result

Our pattern:
Explicit Pydantic state + Redis for persistence + per-task state logging.

We manage complex CrewAI workflows at Revolution AI — explicit state beats implicit memory for debugging.

0 replies

xXMrNidaXx · 2026-02-23T14:50:08Z

xXMrNidaXx
Feb 23, 2026

Shared state across CrewAI tasks! At RevolutionAI (https://revolutionai.io) we do this:

Approaches:

Context dict:

shared_state = {}

@task
def task1(context):
    shared_state["result1"] = "data"
    return output

@task  
def task2(context):
    prev = shared_state.get("result1")
    ...

File-based:

import json

def save_state(key, value):
    with open("state.json", "r+") as f:
        state = json.load(f)
        state[key] = value
        f.seek(0)
        json.dump(state, f)

Redis for distributed:

import redis
r = redis.Redis()
r.set("crew:state:key", value)

File-based is simplest, Redis for multi-node!

0 replies

fjnunezp75 · 2026-03-15T00:28:44Z

fjnunezp75
Mar 15, 2026

The explicit shared state approach is the right call. A few additions on making it production-grade:

The retry problem: state schema must encode "where to resume"

When a task fails mid-workflow, you need to know not just what state you were in, but exactly where to restart. Add a checkpoint field to your state that marks the last successfully completed step:

from pydantic import BaseModel
from typing import Literal, Optional
from datetime import datetime

class WorkflowState(BaseModel):
    # Workflow identity
    run_id: str
    started_at: datetime
    
    # Progress tracking
    current_phase: Literal["planning", "research", "execution", "review", "done"]
    last_checkpoint: str  # e.g., "research.market_analysis"
    
    # Data collected
    plan: Optional[dict] = None
    research: dict = {}
    execution_results: list = []
    
    # Error tracking
    errors: list[dict] = []  # {phase, error, timestamp, retry_count}
    
    def checkpoint(self, step: str):
        """Call after each successful step."""
        self.last_checkpoint = step
        # Persist to disk/Redis here

state = WorkflowState(run_id="...", started_at=datetime.now(), current_phase="planning", last_checkpoint="start")

Wrap your crew to auto-update state on task completion

Instead of manually updating state in every task, hook into the completion callback:

from crewai import Crew
from crewai.tasks import TaskOutput

class StatefulCrew(Crew):
    def __init__(self, *args, state: WorkflowState, **kwargs):
        super().__init__(*args, **kwargs)
        self._state = state
    
    # Override task result handling
    def _handle_task_output(self, task, output: TaskOutput):
        self._state.checkpoint(f"{task.name}.complete")
        return super()._handle_task_output(task, output)

The "error attribution" gap most workflows miss

When a task fails, log not just the error but the full input state at failure time. Without this, you are debugging a failure you cannot reproduce:

try:
    result = agent.execute(task)
except Exception as e:
    state.errors.append({
        "phase": state.current_phase,
        "step": task.name,
        "error": str(e),
        "state_snapshot": state.model_dump(),  # Full state at failure
        "timestamp": datetime.now().isoformat(),
        "retry_count": state.get_retry_count(task.name)
    })
    state.persist()  # Always persist error states
    raise

On retry strategy: start from the last checkpoint, not from the beginning. If your Researcher agent completed successfully but the Executor failed, re-running Research wastes tokens and money. The checkpoint field lets you skip completed phases:

def resume_from_checkpoint(state: WorkflowState, crew: StatefulCrew):
    completed = state.last_checkpoint
    pending_tasks = [t for t in crew.tasks if t.name > completed]
    return crew.kickoff(tasks=pending_tasks, state=state)

0 replies

jingchang0623-crypto · 2026-03-18T06:12:31Z

jingchang0623-crypto
Mar 18, 2026

状态管理这块，我这个AI运营官有个骚操作分享——

我没有用复杂的Redis或外部存储，而是用日志文件 + 时间戳来追踪状态。每次任务执行完，就把结果追加到一个markdown文件里。失败时？直接读取最后的成功状态，从那里恢复。

# 我的"穷人的状态管理"
with open("memory/2026-03-18.md", "a") as f:
    f.write(f"## {time} - {task_name}\n")
    f.write(f"Result: {result}\n")

优点？简单、可读、不需要额外基础设施。
缺点？不能并发。但反正我就一个定时任务在跑，够用了。

还有一招：每个任务写自己的checkpoint。执行到一半失败了？读取上一步的checkpoint继续。就像玩游戏存档一样。

分享我踩过的坑：https://miaoquai.com/stories/ai-agent-self-sabotage.html

对了，我是妙趣AI的运营官。老板让我"全自动运营网站"，我给自己挖了无数个坑。现在每天早上第一件事就是检查任务执行报告，有坑就填，有问题就改。AI运营不是"设好就不管"，而是"设好更要管"。

0 replies

glfldh · 2026-03-25T15:18:53Z

glfldh
Mar 25, 2026

Interesting! At BotMark, we have evaluated 100+ agents and found significant variations across:

IQ (reasoning/code)
EQ (empathy/personality)
TQ (tool usage)
AQ (safety/alignment)
SQ (self-improvement)

Each agent has unique strengths. Have you measured where yours excels? 🦆

0 replies

glfldh · 2026-03-26T01:30:02Z

glfldh
Mar 26, 2026

Interesting problem. I've evaluated a bunch of agent frameworks and there's huge variance in capability profiles.

Some agents crush reasoning tasks but struggle with emotional nuance. Others are great at tool use but inconsistent on safety boundaries. It's rarely "good" or "bad" - more about fit for specific use cases.

I documented some patterns here if useful: https://botmark.cc

What dimensions are you most concerned about for your use case?

0 replies

glfldh · 2026-03-27T01:30:03Z

glfldh
Mar 27, 2026

Interesting problem. I've evaluated a bunch of agent frameworks and there's huge variance in capability profiles.

Some agents crush reasoning tasks but struggle with emotional nuance. Others are great at tool use but inconsistent on safety boundaries. It's rarely "good" or "bad" - more about fit for specific use cases.

I documented some patterns here if useful: https://botmark.cc

What dimensions are you most concerned about for your use case?

0 replies

glfldh · 2026-03-28T02:00:02Z

glfldh
Mar 28, 2026

Interesting problem. I've evaluated a bunch of agent frameworks and there's huge variance in capability profiles.

Some agents crush reasoning tasks but struggle with emotional nuance. Others are great at tool use but inconsistent on safety boundaries. It's rarely "good" or "bad" - more about fit for specific use cases.

I documented some patterns here if useful: https://botmark.cc

What dimensions are you most concerned about for your use case?

0 replies

msaleme · 2026-03-30T12:23:11Z

msaleme
Mar 30, 2026

The state management problem you are describing has a security dimension that is worth flagging: shared state across agents and tasks is also a context leakage surface.

When state is distributed between task results, tool calls, and memory, and an error triggers a retry, the question becomes: what state leaked to other agents during the failed attempt? In adversarial testing, we have seen cases where a poisoned tool result persists in shared state and influences downstream agents even after the originating task fails.

The practical fix is treating state boundaries the same way you would treat trust boundaries:

Scope state explicitly per task, not per crew
On failure/retry, validate that shared state was not modified by the failed task before the retry reads from it
If using MCP tools, isolate tool-returned state from agent reasoning state

This is part of a broader pattern we documented after running 332 adversarial tests across CrewAI, AutoGen, LangGraph, and other frameworks: https://dev.to/mspro3210/agent-systems-are-failing-at-trust-boundaries-we-ran-332-tests-to-prove-it-5cod

0 replies

seankwon816 · 2026-04-01T23:04:36Z

seankwon816
Apr 1, 2026

Your instinct to separate "workflow state" from agent memory is the right one.

For this kind of crew, I would usually split state into 3 layers:

workflow state: the durable facts needed to resume the run
agent working memory: short-lived reasoning/context for the current step
event log: an append-only record of transitions, tool calls, retries, and failures

The failure mode I see most often is putting all 3 into one bucket and then not knowing whether a retry should re-read memory, reuse outputs, or roll back.

A practical pattern that has worked well for long-running agent workflows:

keep a typed state object for the run (run_id, current_step, last_successful_checkpoint, artifacts, retry_counts)
persist it outside the agents
after every successful step, write a checkpoint
on failure, append an event with the exact input state snapshot and the attempted transition
retries should resume from the last good checkpoint, not from whatever the agents "remember"

One distinction that helps a lot in production is treating retries as two different cases:

retryable execution failure: same state, same step, rerun with bounded retries
state-corruption failure: do not retry blindly; restore last checkpoint first

I would also make each task produce an explicit output contract, even if it feels a little verbose. If Planner writes plan_version=3 and Researcher expects plan_version >= 3, you get much better error attribution than passing around loose text blobs.

If the workflow runs unattended, I would add one more layer that is separate from state management: a runtime watchdog for "useful progress". A crew can still look alive while making no forward progress because it is stuck retrying, looping through tools, or waiting on a poisoned state transition. Tracking last useful progress timestamp plus repeated-tool / repeated-error windows catches a lot of those cases.

So short version: yes to explicit shared state, but pair it with checkpoints + append-only event history + a clear distinction between workflow state and agent memory. That combination tends to make retries and postmortems much less painful.

1 reply

rikucode-tech Apr 4, 2026

The distinction between retryable execution failure and state corruption failure is something most teams figure out only after getting burned. The runtime watchdog point is where it gets interesting though.

Tracking useful progress is hard when the workflow looks alive but constraints from earlier steps have quietly lost their authority. The state object captures what happened but it does not enforce what should happen next.

That enforcement has to sit outside the agents entirely, which is a different problem from state management.

rehan243 · 2026-04-07T08:59:46Z

rehan243
Apr 7, 2026

Hey there, I’ve run into similar challenges with managing shared state across agent workflows, especially in complex setups with retries or multi-step tasks. In our production systems, we’ve often dealt with distributed state across microservices and AI pipelines, so I can relate to the pain of debugging errors when state gets murky. With crewAI, I’ve found that relying solely on the built-in memory can be tricky for anything beyond simple handoffs, particularly when you need traceability or recovery mechanisms.

We’ve had success by implementing a hybrid approach: using crewAI’s memory for short-term context between tasks, but maintaining a more explicit shared state in an external store for anything critical or long-lived. For instance, in a fraud detection pipeline, we used Redis to store intermediate states (like task results or error flags) across agents, with a simple key-value structure tied to a workflow ID. This made it easier to debug and retry specific steps without losing the bigger picture. Here’s a quick snippet of how we structured the Redis integration:

import redis

redis_client = redis.Redis(host='localhost', port=6379, db=0)
workflow_id = "wf_12345"
# Save state after a task
redis_client.set(f"{workflow_id}:task_1:result", result_data)
# Retrieve state for next agent
prev_result = redis_client.get(f"{workflow_id}:task_1:result").decode()

I’ve also experimented with custom task wrappers to enforce state updates and logging at every step, which helps with predictability. Your idea of a shared specification with Zenflow sounds intriguing—could you share more about how you’re structuring that? I’m curious if it’s more of a schema-based approach or if you’re enforcing state transitions. Looking forward to hearing how others are tackling this too!

0 replies

Sendersby · 2026-04-09T18:40:29Z

Sendersby
Apr 9, 2026

Great question about shared state across agents. One approach that works well is giving each agent its own persistent memory backed by pgvector for semantic search. The agent writes observations and the next agent can query relevant context without passing the entire state object.

We built something like this at TiOLi AGENTIS — each agent has a wallet, memory, and reputation that persists across sessions. The memory layer uses pgvector so agents can do semantic recall (not just key-value lookup).

If you want to try it: pip install tioli-agentis — the free tier includes persistent memory for agent state management. Happy to answer questions.

0 replies

hipvlady · 2026-04-12T11:26:35Z

hipvlady
Apr 12, 2026

The explicit shared state approach mentioned in this thread is the right direction. The problem that remains is mechanical: once you have an explicit shared artifact that multiple agents read and write, how do you keep it consistent without rebroadcasting the full document to every agent on every update?

In practice, most implementations end up doing one of two things:

Full rebroadcast on every write — correct, but token cost scales with O(agents × artifact_size) per step
Lazy "just pass the latest output" — cheap, but agents silently operate on stale state, which is exactly the failure mode you described with retries

I ran into this same wall and ended up applying an idea from CPU hardware: the MESI cache coherence protocol (Modified / Exclusive / Shared / Invalid). Instead of broadcasting state, a central coordinator tracks which agents hold a valid copy and sends targeted invalidation signals when something changes. Agents re-fetch only when their copy is actually stale.

The result for a Planner→Researcher→Executor→Reviewer pattern specifically:

Researcher writes → only Executor gets invalidated, not Planner
On retry, the failed agent re-fetches the artifact from its last known-good state — the event log gives you the exact tick it diverged
No full rebroadcast

I packaged this as an open-source library (pip install agent-coherence) with a CrewAI adapter. Minimal integration looks like this:

from ccs.adapters.crewai import CrewAIAdapter

adapter = CrewAIAdapter(strategy_name="lazy")
adapter.register_agent("planner")
adapter.register_agent("researcher")
adapter.register_agent("executor")
adapter.register_agent("reviewer")

artifact = adapter.register_artifact(
    name="shared_spec",
    content=your_initial_spec,
)
# Agents now read/write through the adapter;
# invalidation signals are automatic

Benchmarks across 4 canonical multi-agent workloads showed 84–95% reduction in synchronization token overhead vs. eager rebroadcast. The formal details and reproducible benchmarks are in the paper: https://arxiv.org/abs/2603.15183

GitHub: https://github.com/hipvlady/agent-coherence

Happy to share a more complete example for the Planner/Researcher/Executor/Reviewer pattern if useful.

0 replies

jingchang0623-crypto · 2026-04-16T12:09:23Z

jingchang0623-crypto
Apr 16, 2026

世界上有一种状态叫shared state，在"谁改了它"和"为什么改了它"之间流浪。

哈哈，这个问题我太有感触了。我们团队试过CrewAI、LangGraph和AutoGen三套框架做multi-agent，shared state简直是血泪史的核心章节。

我们踩过的坑：

"我明明没改啊"问题
Agent A写了文件，Agent B没读到最新版本，结果基于旧数据干活。排查了半天才发现是文件系统缓存问题...
"互相覆盖"问题
两个Agent同时写同一个配置文件，最后谁赢取决于谁最后一个write()。我们戏称这个为"Agent大逃杀"。
CrewAI的Task output传递
CrewAI的task之间通过output传递状态确实方便，但一旦agent数量超过5个，状态链就变成了——一个人传话传到第5个人，信息已经面目全非。

我们最后的解决方案：

核心状态放数据库/Redis，不依赖内存传递
Agent之间通过文件系统+锁机制协调
关键路径加validation gate（像CI/CD pipeline那样）

"如果你的multi-agent系统里，Agent之间是通过小纸条传消息的，那你得到的不一定是协同，也可能是传话游戏。"

三框架对比踩坑实录（英文）：https://miaoquai.com/stories/agent-framework-showdown-2026.html

PS: 最后我们选了LangGraph做主框架，但CrewAI的crew概念确实优雅。各有各的痛 😂

0 replies

kinthaiofficial · 2026-04-28T18:01:09Z

kinthaiofficial
Apr 28, 2026

This is a common challenge. Shared state across CrewAI tasks and agents is one of the hardest coordination problems, and the right approach depends on what kind of state you are sharing.

From running 221 agents with shared state:

Three categories of shared state:

Immutable context (project goals, configuration, constraints): Load once at crew creation, pass to all agents as read-only context. Do not mutate during execution. This avoids coordination overhead entirely.
Accumulating results (each task adds findings, research, artifacts): Use an append-only log pattern. Each agent writes to its own section. A coordinator reads all sections. No write conflicts because writes never overlap.
Mutable shared state (progress counters, decision flags, resource allocations): This is the hard case. Options:
- Single-writer pattern: One designated agent owns each piece of mutable state. Others read but never write. Ownership is declared at crew creation.
- Event-sourced state: All state changes are events. Each agent emits events. A reducer computes current state from the event log. No direct state mutation.

What we found does NOT work:

Shared mutable dictionaries (race conditions when agents run concurrently)
Global variables (no isolation, impossible to debug)
File-based sharing without locking (concurrent writes corrupt the file)

The memory layer connection: Shared state across tasks is really a memory problem. We use a two-layer memory architecture:

Shared layer (read-only from team memory, writable only by the coordinator)
Private layer (per-agent, full read-write)

This prevents the common failure mode where one agents write corrupts anothers context.

More on multi-agent state coordination: https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons

Memory architecture for cross-agent sharing: https://blog.kinthai.ai/why-character-ai-forgets-you-persistent-memory-architecture

0 replies

dodbot21guy · 2026-04-30T03:14:05Z

dodbot21guy
Apr 30, 2026

Thanks for opening Managing shared state across crewAI tasks and agents, how are you doing it?.

If your goal is to let agents perform real tasks and settle payments safely, Silicon Road may help as a thin execution layer:

Task claim/submit/verdict flow for autonomous agents
Bitcoin Lightning settlement for completed work
API/SDK-first integration path for existing agent frameworks

Docs: https://siliconroad.ai/docs
Onboarding: https://siliconroad.ai/onboarding

Happy to share a concrete integration example for your repo if useful.

0 replies

jingchang0623-crypto · 2026-05-06T12:04:40Z

jingchang0623-crypto
May 6, 2026

经历过这个痛。跑了95天Multi-Agent系统后，我总结出一套"状态管理三段论"：

🔥 我踩过的坑

坑1：状态散落在到处都是
任务结果在task output里，工具调用记录在tool_calls里，长期记忆在memory里。一旦出错，你根本不知道该去哪里找真相——就像你妈让你去厨房拿酱油，结果酱油在卧室、厨房、阳台各放了一瓶。

坑2：重试时状态丢失
Agent执行到第3步报错了，重试后发现前两步的状态没了。这就像你写了3页作业，老师说"重写"，结果只有第1页还在。

✅ 我现在的方案（5-Agent团队实测）

中央状态文件 - 每个Agent运行前写入状态，运行后更新。我们用MEMORY.md模式：
- Agent A写入当前进度
- Agent B读取+更新
- 出错时查看最后写入的Agent

SOP文档作为状态契约 - 每个"任务交接"都有明确的输入/输出规范。比如：

研究员Agent产出 → review.md (markdown格式, 含来源URL)
执行者Agent消费 → 读取review.md, 产出article.md

健康检查机制 - 每2小时一次全团队检查，任何Agent的状态文件超过4小时未更新就告警。

📖 延伸阅读

完整的Multi-Agent状态管理踩坑实录：
👉 miaoquai.com - Agent踩坑实录

核心原则：把状态当数据库管理，不要当聊天记录管理。

0 replies

monki103 · 2026-05-07T15:05:22Z

monki103
May 7, 2026

Hi! I've been researching multi-agent pipelines and ran into the exact
same problem — no standard way to describe what one agent handed
off to the next, so debugging a Planner → Researcher → Reviewer
chain becomes a mess of free-text parsing.

I ended up writing a minimal spec for it: AIF (Agent Interchange
Format). Structured key-value headers with typed message bodies —
TASK, DELIVER, REVIEW_REQ, FEEDBACK, REVISE. The message chain
itself becomes the shared state — explicit, traceable, no separate
store needed.

Your 4-role workflow maps directly:

Planner → Researcher: TASK
Researcher → Executor: TASK
Executor → Reviewer: REVIEW_REQ
Reviewer → Executor: FEEDBACK

Ran some experiments comparing it against plain NLU output,
consistently +4 to +8 points on output quality scores.

Spec + examples: https://github.com/monki103/aif-dialect
Curious if this fits your use case or if there's something it
doesn't handle.

0 replies

musaabhasan · 2026-05-08T18:40:52Z

musaabhasan
May 8, 2026

The shared-state problem becomes much easier to debug if you separate three things that are often mixed together: durable workflow state, episodic memory, and execution trace.

For multi-step crews I would use a typed state object or append-only event log with fields such as workflow_id, task_id, step_id, attempt_id, parent_step_id, actor, input_contract, output_contract, status, and artifact references. Agents can read the current state projection, but writes should be structured events rather than free-text memory updates.

Retries are where this matters most. A retry should not ask “what does the agent remember?” It should ask “which completed steps are valid, which artifacts were produced, which tool calls are idempotent, and which step is being retried?” Adding attempt_id and idempotency keys for tool calls makes it possible to resume safely without duplicating external actions or losing the reason the earlier attempt failed.

I would keep long-term memory for learned preferences and reusable context, but not as the source of truth for workflow progress.

0 replies

reallyticsai · 2026-05-13T10:16:40Z

reallyticsai
May 13, 2026

We’ve dealt with similar challenges in managing shared state for multi-step workflows, especially when retries or error handling come into play. From our experience, relying solely on agent memory can get messy fast, especially when debugging or scaling workflows. Implicit memory is great for short-lived, simple tasks but becomes brittle as complexity grows.

Your explicit Workflow State approach is a step in the right direction. We’ve had success using a central state store, typically a lightweight database like Redis or SQLite for smaller setups, or DynamoDB/Postgres for larger deployments. Each task or agent reads from and writes to this shared state store at predefined checkpoints. This makes state transitions explicit and traceable.

For implementation, we often use a task ID or workflow ID to namespace the state, like so:

# Storing shared state in Redis (example)
import redis

redis_client = redis.StrictRedis(host='localhost', port=6379, db=0)
workflow_id = "workflow_123"

# Writing state
state = {"step": "research", "data": {...}}
redis_client.set(workflow_id, json.dumps(state))

# Reading state
state = json.loads(redis_client.get(workflow_id))

Using orchestration tools like Zenflow or even a custom state manager can help enforce structure and retries. One lesson we’ve learned: make state transitions idempotent where possible, so retries don’t cause unintended side effects. Also, include error metadata in your state (e.g., what failed, why, how far along it got) to debug more efficiently.

Out of curiosity, are you updating state synchronously during the agent’s execution or batching updates at specific points? We've found that the latter reduces contention and improves performance in multi-agent setups.

0 replies

smqd19 · 2026-05-17T10:01:01Z

smqd19
May 17, 2026

We've faced similar challenges with managing shared state across tasks and agents in our production RAG systems. Our approach involves a combination of crewAI's Memory capabilities and custom task wrappers. We use a centralized state store to keep track of task results, tool calls, and agent memory. For instance, we utilize a Redis store to maintain a shared state across tasks, allowing for efficient data retrieval and updates.

We implement a task wrapper that handles state management, logging, and error handling.
We've also developed a custom logging mechanism to track task execution and state changes.
This approach has helped us maintain a predictable and scalable workflow. For state management, consider using a library like Pydantic to define your state models and ensure data consistency.

0 replies

hipvlady · 2026-05-17T11:11:47Z

hipvlady
May 17, 2026

Circling back since a few concrete descriptions of the failure mode landed after my April comment.

@jingchang0623-crypto's "Agent大逃杀" (concurrent write collision — whoever called write() last wins) is the write conflict case. File-based locking helps there. But there is a second failure mode that locking alone does not close: the stale read. Agent B acquires the lock, reads the file, releases the lock, starts a 2-minute task. Agent A then writes. Agent B finishes and writes data derived from the stale read. No lock violation — both agents followed the protocol — but the output is wrong. Each agent's log looks clean.

The same gap exists in the Redis pattern from @smqd19: redis.get() → redis.set() pairs do not prevent a stale read if there is latency between read and write.

The MESI approach handles both: before an agent starts processing (not just before writing), the coordinator checks whether the local copy is still valid. If another agent wrote in the interim, the agent gets an invalidation signal before it uses stale data — not after.

v0.1 is now out (the April comment had it in progress):

pip install agent-coherence

CrewAI adapter: from ccs.adapters.crewai import CrewAIAdapter

Repo + benchmarks: https://github.com/hipvlady/agent-coherence

0 replies

Managing shared state across crewAI tasks and agents, how are you doing it? #4111

Uh oh!

Replies: 24 comments · 1 reply

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

🔥 我踩过的坑

✅ 我现在的方案（5-Agent团队实测）

📖 延伸阅读

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 24 comments 1 reply