This guide walks you through adopting the template for your existing multi-agent system.
What this means in practice:
You have some "Foreman" / orchestrator that:
- Receives a user or business request.
- Breaks it into subtasks.
- Routes those to specialist agents.
- Collects their outputs and returns a result.
This can be:
- A custom Python or TypeScript script coordinating multiple LLM calls.
- A framework like LangGraph, AutoGen, CrewAI, or AWS Multi-Agent Orchestrator.
- A simple "router" agent plus a set of tool-like agents behind an API.
You do not need anything fancy. But you should be able to point to one place in your code and say: "This is where tasks get decomposed and delegated." If you can do that, you satisfy this prerequisite.
Not there yet? A good starting reference is the AWS Multi-Agent Orchestrator guide or the patterns in Agentic-AI-Systems on GitHub, which covers orchestration patterns across multiple frameworks.
What this means in practice:
You can edit:
- The system/role prompts for your Foreman and department heads — what they think their job is, what tools they have, what "done" looks like.
- The order of steps they run — e.g., inserting "generate → self-critique → revise → return" instead of "generate → return".
Concretely, you can change either a prompt file or a small piece of code that defines a workflow or state machine. You don't need to own the full agent runtime, just the parts that define your agents' behavior.
In this repo, that maps to:
config/prompts/*for role prompts (system messages).architecture/WORKFLOWS.mdand the example scripts for execution flow patterns.
Why it matters here: Tier 1 reflection is just one extra step in the execution flow (call critic prompt, handle the verdict). If you can add a step, you can wire Tier 1 in an afternoon.
What this means in practice:
You can already send a prompt to a model and get back text:
- Via API: OpenAI, Anthropic, Google Gemini, Mistral, etc.
- Via local runtime: Ollama, vLLM, LM Studio, etc.
The minimum capability is:
response = llm_client.complete(system_prompt, user_prompt) # returns stringYou do not need streaming, tool use, or function calling for Tiers 1 and 2. Structured JSON responses (either native JSON mode or parsed from text) are useful for the evaluation and reflection prompts.
In the examples, every Python stub assumes a simple call_llm(prompt) -> dict abstraction. Replace it with whatever client you use — the pattern doesn't change.
Starter references: OpenAI Python client and Anthropic Python SDK both have minimal working examples in their READMEs.
Already have a Foreman + department heads running? You probably satisfy all three prerequisites. Your first move is just:
- Map your agents into
AGENTS.md(the "How to Map Your Current Setup" table).- Wire the Tier 1 reflection loop (critic + revise) for one department — Growth is a good starting point, see
examples/growth_agent/for a runnable demo.- Start logging episodes using
self_improvement/evaluation/episode_schema.yaml.Once those three are in place, the evaluation and policy-update layers become "just configuration and prompts" — not a redesign.
If you've got a Foreman (or equivalent orchestrator) and department-head agents already executing tasks, you don't need to rebuild anything. Focus on wiring in two things first:
- Pick your highest-volume department (the one producing the most output per day).
- In that agent's execution flow, insert one additional LLM call after it produces output but before it emits a completion signal. Use
self_improvement/reflection_loops/critic_prompt.mdas the prompt template. - Parse the critic's JSON response. If
verdictis"pass", proceed as normal. If"revise", let the agent fix the issues once. If"escalate", signalneeds-review. - That's it. You now have a self-checking agent. Expand to other departments once you see the pattern working.
- After each task completes (regardless of signal), append a JSON object to a
.jsonlfile. One file per department per day works well. - Start with the minimum fields:
episode_id,task_id,department,agent_id,task_description,result.signal, and the reflection verdict if you've wired it. - You don't need a database. Flat files are fine for the first 1000+ episodes.
- Once logs are flowing, you have the data foundation for Tier 2 evaluation and Tier 3 policy adaptation.
See docs/logging_and_episodes.md for the full schema and progressive detail levels.
- Tier 2 (evaluation rubrics) — skip until you have 20+ logged episodes per department.
- Tier 3 (meta-agent policy updates) — skip until Tier 2 is calibrated. See
docs/optional_paths.mdfor the full decision tree. - Prompt versioning — useful but not urgent. Add YAML front matter to your prompts when you're ready to track changes.
Open AGENTS.md and fill in the "How to Map Your Current Setup" table with your actual agent names and IDs. This makes the repo a live reference for your system, not just a template.
The steps below are for adopting the full template systematically. If you've already done the quick-start above, skip to Step 3.
Open AGENTS.md and map your existing agents onto the template:
- Identify your Foreman — whatever orchestrates your top-level task decomposition.
- List your departments — add or remove rows from the Department Heads table to match your actual team.
- Document sub-agents — for each department, note what focused tasks sub-agents handle.
- Define permissions — update
config/policies/spawn_rules.yamlwith your actual spawning rules.
If you have departments not in the template (e.g., "legal review", "data pipeline"), add them. The pattern is the same.
Open GOALS.md and fill in every bold placeholder:
- Business KPIs — what are you actually optimizing for?
- Department goals — what does "good output" look like per department?
- Quality targets — set initial targets. You'll calibrate these later.
- Self-improvement milestones — what does "better" mean for your system?
Tip: Start with conservative targets. You can tighten them once you have baseline data.
This is the highest-ROI first step. For each agent:
- Copy the critic prompt from
self_improvement/reflection_loops/critic_prompt.md. - Customize it for the agent's domain (use the department-specific additions as a starting point).
- Wire it in — after the agent produces output but before it emits a completion signal, insert a call to the critic prompt. See
self_improvement/reflection_loops/reflection_loop.pyfor the pattern. - Handle the verdict:
pass→ emitdonerevise→ revise once, then emitdoneorneeds-reviewescalate→ emitneeds-review
Start with 1-2 departments (we recommend Build and Growth). Expand once you see the pattern working.
Before you can evaluate quality trends, you need data.
- Define your log storage —
.jsonlfiles work fine to start. One file per department per day. - Implement the logger — see
docs/logging_and_episodes.mdfor the schema andself_improvement/evaluation/episode_schema.yamlfor the full field list. - Log at minimum: task_id, department, agent_id, task_description, result signal, reflection verdict.
- Verify by running a few tasks and checking the log files.
Once you have 20+ logged episodes per department:
- Review the rubrics in
self_improvement/evaluation/rubrics.yaml. Adjust criteria and weights for your departments. - Set up the evaluator — after episodes are logged, run the evaluator prompt from
self_improvement/evaluation/evaluator_prompt.mdto score them. - Calibrate — have humans score 20-30 episodes and compare with the automated evaluator. Adjust until correlation is acceptable (> 0.7).
- Track trends — compute rolling averages per department. Set up alerts for declining quality.
See docs/optional_paths.md for detailed prerequisites. In short:
- Don't enable this until Tier 2 is stable.
- Ensure you have a human review process for proposed changes.
- Start with low-risk changes only (prompt clarifications, checklist additions).
| Priority | File | Action |
|---|---|---|
| 1 | GOALS.md |
Fill in all bold placeholders |
| 2 | AGENTS.md |
Add/remove departments to match your team |
| 3 | config/policies/spawn_rules.yaml |
Set your actual spawn permissions |
| 4 | config/policies/escalation_rules.yaml |
Set your escalation thresholds |
| 5 | config/prompts/foreman.md |
Customize for your Foreman's actual behavior |
| 6 | config/prompts/department_head.md |
One copy per department, customized |
| 7 | self_improvement/reflection_loops/critic_prompt.md |
Customize critic for your first department |
Q: Do I need to use Python?
No. The .py files are stubs illustrating patterns. Implement in whatever language your agent system uses.
Q: Can I skip Tier 1 and go straight to Tier 2? You can, but Tier 1 is nearly free (one extra LLM call) and catches obvious issues. Start there.
Q: How many episodes do I need before Tier 3 is useful? At least 50 per department with Tier 2 scores. Fewer than that and the meta-agent won't have enough signal.
Q: Can I use this with LangChain / CrewAI / AutoGen / etc.? Yes. This template defines patterns and configs, not a runtime. Map the concepts onto your framework's abstractions.