The definitive OpenAI, Anthropic, Google, MCP, Harness, Evals, and Production Agent Systems learning roadmap.
If this repository helps you, consider giving it a ⭐
The AI industry has entered the Agentic Era. Building production-grade AI systems now requires mastering agents, tool use, MCP, memory, long-running workflows, coding agents, agent harnesses, evals, and safety — but the knowledge is scattered across OpenAI blogs, Anthropic engineering posts, SDK docs, cookbooks, and research papers.
This repository consolidates 174 curated resources into one structured learning roadmap.
The goal: Become a world-class Agentic Engineer.
Pick the path that matches your starting point:
- New to agents: follow the Learning Roadmap from Phase 0 to Phase 6. Treat each
Read First,Then Read, andBuild Exerciseas a checklist. - Already building LLM apps: start at Phase 2 or Phase 3, then fill gaps in agent loop, tool calling, evals, and production engineering.
- Trying to build projects: use the phase-level
Build Exerciseprompts, then branch into Applied Practice Tracks for coding agents, security, code review, or SRE. - Looking for references: jump to the Full Reading Table. Read
P0first, useP1for implementation detail, and keepP2as optional background.
If you treat Claude Code as a coding CLI, many capabilities can feel like magic: it reads files, runs commands, edits code, delegates work, and stays oriented during complex tasks.
From an engineering perspective, the core is much simpler:
model + tools + one loop.
Understanding that loop makes the rest of the system easier to reason about:
- When the agent should plan first, and when it should act immediately
- Why an explicit todo list reduces drift in longer tasks
- Why subagents improve exploration while protecting the main context
- How skills, MCP, and hooks each add capability around the same core loop
These pages are based on the upstream English Markdown tutorials from shareAI-lab/mini-claude-code, with added Study Notes and inline source code for this handbook.
Supporting files are included in the same folder: requirements.txt, .env.example, v0_bash_agent_mini.py, and skills/.
Build shared vocabulary for workflow vs agent, tool loop, handoff, guardrails.
Should I build an agent? (4-question checklist from Barry Zhang's talk)
| Question | If No → Workflow | If Yes → Agent |
|---|---|---|
| Is the task complex enough? | Decision tree is fully mappable | Ambiguous problem space |
| Is the task valuable enough? | <$0.10 per run | >$1 per run, cost doesn't matter |
| Are all core capabilities doable? | Weak links break the chain | Model handles every step well |
| Is error cost low & detectable? | High cost + hard to detect → human-in-the-loop | Errors caught by tests/CI |
Think like the agent. Most failures come from designing with a human perspective. Put yourself inside the agent's context window: you only see ~10K–20K tokens (system prompt + tool descriptions + recent observations). Ask: does the agent have enough information to act correctly at each step?
→ Source: How We Build Effective Agents
| # | Title | Vendor |
|---|---|---|
| 1 | System Prompts | Anthropic |
| 2 | Prompt guidance | OpenAI |
| 3 | Function Calling | OpenAI |
| 4 | Tool use overview | Anthropic |
| 5 | Function calling - Gemini API | |
| 6 | Building effective agents | Anthropic |
| 7 | New tools for building agents | OpenAI |
| 8 | Agents SDK overview | OpenAI |
| Title | Vendor |
|---|---|
| How We Build Effective Agents: Barry Zhang, Anthropic | Anthropic |
| Phistory — Claude Code & Codex CLI System Prompt Diff History | Community |
| Coding Agents 101: The Art of Actually Getting Things Done | Cognition |
| OpenAI Agents SDK examples | OpenAI |
| Structured Outputs for Multi-Agent Systems | OpenAI |
Build a customer service/ticket triage agent: router → specialist → evaluator, with all outputs constrained by structured schemas.
Understand MCP server/client, remote vs local, tool loading, approval, connector boundaries.
| # | Title | Vendor |
|---|---|---|
| 1 | Introducing the Model Context Protocol | Anthropic |
| 2 | MCP and Connectors | OpenAI |
| 3 | Building MCP servers for ChatGPT Apps and API integrations | OpenAI |
| Title | Vendor |
|---|---|
| Code execution with MCP: Building more efficient agents | Anthropic |
| Writing effective tools for AI agents - with AI agents | Anthropic |
| Model Context Protocol - Codex | OpenAI |
| Build a Remote MCP server | Cloudflare |
| Introducing the MCP Registry | MCP |
| OpenAI Docs MCP | OpenAI |
| Build your ChatGPT UI | OpenAI |
Build a read-only repo/docs MCP server, then create an eval to verify the agent correctly cites documentation.
Learn to control context window, short/long-term memory, skills/plugins, CLAUDE.md/AGENTS.md.
| # | Title | Vendor |
|---|---|---|
| 1 | Agent Skills Specification | Agent Skills |
| 2 | Effective context engineering for AI agents | Anthropic |
| 3 | How the Open Knowledge Format can improve data sharing | Google Cloud |
| 4 | How Long Contexts Fail | Drew Breunig |
| 5 | Context Rot | Chroma |
| 6 | Progressive disclosure | Claude-Mem |
| 7 | Equipping agents for the real world with Agent Skills | Anthropic |
| 8 | Agent Skills | Anthropic |
| 9 | Skills | OpenAI |
| 10 | Building Reliable Agents with Memory and Compaction | OpenAI |
| Title | Vendor |
|---|---|
| Custom instructions with AGENTS.md - Codex | OpenAI |
| Best practices for Claude Code | Anthropic |
| Agent Skills - Codex | OpenAI |
| Skills in OpenAI API | OpenAI |
Implement the same task as a Skill/Plugin, then measure accuracy and token cost across three variants: no skill, long prompt, and skill-based.
Master agent runtime: event stream, thread, tool execution, state, sandbox, approval, recovery.
| # | Title | Vendor |
|---|---|---|
| 1 | Unrolling the Codex agent loop | OpenAI |
| 2 | Unlocking the Codex harness: how we built the App Server | OpenAI |
| 3 | Agent Harness Engineering: A Survey | Academic |
| 4 | Effective harnesses for long-running agents | Anthropic |
| 5 | Deep Agents | LangChain |
| Title | Vendor |
|---|---|
| Deep research | OpenAI |
| Open Deep Research | LangChain |
| The next evolution of the Agents SDK | OpenAI |
| Using PLANS.md for multi-hour problem solving | OpenAI |
| Build long-running AI agents that pause, resume, and never lose context with ADK | |
| Harness design for long-running application development | Anthropic |
| Scaling Managed Agents: Decoupling the brain from the hands | Anthropic |
Build a mini coding harness: plan file, shell tool, apply patch, test gate, event log, and resume capability.
Compare Codex vs Claude Code product/SDK forms; learn multi-agent, IDE, workspace collaboration.
| # | Title | Vendor |
|---|---|---|
| 1 | Introducing Codex | OpenAI |
| 2 | Best practices for Claude Code | Anthropic |
| 3 | How Claude Code works in large codebases | Anthropic |
| 4 | Enabling Claude Code to work more autonomously | Anthropic |
| Title | Vendor |
|---|---|
| Introducing the Codex app | OpenAI |
| Introducing workspace agents in ChatGPT | OpenAI |
| Apple's Xcode now supports Claude Agent SDK | Anthropic |
| Building Consistent Workflows with Codex CLI & Agents SDK | OpenAI |
| Best practices for Claude Code | Anthropic |
| The spec is dead, long live the spec! | Ravi on Product |
| How Anthropic teams use Claude Code | Anthropic |
| Multi-stack Web App Builds | Community |
Run both OpenAI/Codex and Claude Code style workflows on the same repo: issue → plan → patch → tests → PR summary.
Build pre/post-launch eval loop, trace loop, safety boundaries, permissions, regression monitoring.
| # | Title | Vendor |
|---|---|---|
| 1 | Demystifying evals for AI agents | Anthropic |
| 2 | The six generations of AI agents and how to eval them | Braintrust |
| 3 | Agent observability powers agent evaluation | LangChain |
| 4 | Agent Evaluation Readiness Checklist | LangChain |
| 5 | Build an Agent Improvement Loop with Traces, Evals, and Codex | OpenAI |
| 6 | Macro Evals for Agentic Systems | OpenAI |
| 7 | Testing Agent Skills Systematically with Evals | OpenAI |
| Title | Vendor |
|---|---|
| How we build evals for Deep Agents | LangChain |
| Deep Research Bench | FutureSearch |
| How to Evaluate Tool-Calling Agents | Arize |
| AI agent evaluation: How to test, debug, and improve agents in production | Arize |
| A Survey on Agent-as-a-Judge | Academic |
| Running Codex safely at OpenAI | OpenAI |
| How we contain Claude across products | Anthropic |
| Evals API Use-case - MCP Evaluation | OpenAI |
| Measuring AI agent autonomy in practice | Anthropic |
Build a smoke/macro eval suite for your agent: task success rate, tool misuse, prompt injection resistance, latency, cost, and human approval count.
Use these tracks after the core roadmap when you want to practice agentic engineering in real engineering workflows.
| Track | Start Here | Why It Matters |
|---|---|---|
| Agentic coding workflow | Coding Agents 101, How Claude Code works in large codebases, How Anthropic teams use Claude Code | Turns agent theory into day-to-day engineering habits: prompting, checkpoints, verification, parallel work, and team rollout. |
| Spec-driven building | The spec is dead, long live the spec!, Multi-stack Web App Builds | Treats specs, prompts, and assignments as executable source material for agents. |
| Context failure modes | How Long Contexts Fail, Context Rot, Progressive disclosure | Helps diagnose context poisoning, distraction, confusion, context degradation, and retrieval overload. |
| Evals and observability | Demystifying evals for AI agents, Agent observability powers agent evaluation, Agent Evaluation Readiness Checklist | Builds the feedback loop for traces, datasets, graders, offline/online evals, and regression gates. |
| Deep research agents | Deep research, Open Deep Research, Alibaba-NLP/DeepResearch | Practices long-running research agents: planning, search, MCP, citations, report synthesis, and benchmark-driven improvement. |
| MCP operations | Build a Remote MCP server, Introducing the MCP Registry | Shows how MCP moves from local prototypes to authenticated, discoverable, production-grade tool ecosystems. |
| Agent security | OWASP Top Ten, SAST vs. DAST vs. RASP, Copilot Remote Code Execution via Prompt Injection | Grounds agent security in classic AppSec plus new prompt-injection and tool-permission failure modes. |
| Code review systems | How to Review Code Effectively, AI-Assisted Assessment of Coding Practices in Modern Code Review, AI Code Review Implementation Best Practices | Connects human review quality with AI-assisted review, automated comments, and review policy design. |
| Production and SRE agents | ML and LLM system design, Introduction to Site Reliability Engineering, Observability Basics You Should Know | Extends agents beyond coding into incidents, observability, root-cause analysis, on-call, and production operations. |
Priority guide: P0 = must-read (architectural/conceptual), P1 = highly useful (implementation detail), P2 = optional context (background/releases).
| Priority | Title | Vendor | Topic | Key Idea | Date |
|---|---|---|---|---|---|
| P0 | OpenAI for Developers in 2025 | OpenAI | Agents; MCP; Platform | Annual overview: systematic walkthrough of Responses API, Agents SDK, AgentKit, Codex, MCP, Apps SDK, and AGENTS.md. | 2025-12-30 |
| P0 | New tools for building agents | OpenAI | Agents; Responses API; Tools | Key starting point for OpenAI's agent platform: Responses API, built-in web/file/computer tools, Agents SDK, tracing/observability. | 2025-03-11 |
| P0 | Introducing AgentKit | OpenAI | Agents; Evals; AgentKit | AgentKit, expanded evals, agent RFT: the official agent toolchain from prototype to production. | 2025-10-06 |
| P0 | Prompt guidance | OpenAI | Prompting; Models; Agent UX | Official model-specific prompting guidance for outcome-first prompts, reasoning effort, preambles, and validation rules in tool-heavy workflows. | Current docs |
| P0 | System Prompts | Anthropic | System prompts; Claude; Behavior | Claude web/mobile system prompt release notes; useful for studying production prompting patterns and behavioral scaffolding. | Current docs |
| P0 | Agents SDK overview | OpenAI | Agents; SDK | Official SDK entry point: concepts and boundaries of agent, tool, handoff, guardrail, and tracing. | Current docs |
| P0 | Introducing the Model Context Protocol | Anthropic | MCP; Standards | The origin article for MCP: an open standard connecting AI assistants to data, tools, and systems. | 2024-11-25 |
| P0 | Building effective agents | Anthropic | Agents; Patterns; Frameworks | Essential agent primer: workflow vs agent, prompt/tool/retrieval, orchestrator-worker, evaluator-optimizer patterns. | 2024-12-19 |
| P0 | Coding Agents 101: The Art of Actually Getting Things Done | Cognition | Coding agents; Workflows; Practice | Product-agnostic guide to prompting, delegation, verification, environment setup, security, and cost management for coding agents. | 2025-06 |
| P0 | New tools and features in the Responses API | OpenAI | MCP; Responses API; Tools | Responses API extended to remote MCP servers, image/code/file tools; see how OpenAI integrates MCP into its runtime. | 2025-05-21 |
| P0 | MCP and Connectors | OpenAI | MCP; Connectors; Responses API | Official guide to connecting remote MCP servers and connectors; includes approvals and security considerations. | Current docs |
| P0 | Building MCP servers for ChatGPT Apps and API integrations | OpenAI | MCP; ChatGPT Apps; API | Official guide to writing MCP servers: supply tools/knowledge to ChatGPT Apps, deep research, and API integrations. | Current docs |
| P0 | Deep research | OpenAI | Deep research; MCP; API | Official guide to deep research models, including web search, file search, remote MCP servers, code interpreter, and security risks. | Current docs |
| P0 | Building a Deep Research MCP Server | OpenAI | MCP; Deep research | Minimal implementation of a search/fetch MCP server for Deep Research. | 2025-06-25 |
| P0 | Model Context Protocol - Codex | OpenAI | MCP; Codex | How Codex CLI/IDE connects to MCP servers, adding Figma, browser, docs, and internal tool context to agents. | Current docs |
| P0 | Introducing Codex | OpenAI | Agents; Coding; Sandbox | Cloud-based software engineering agent: parallel tasks, repo sandbox, running tests/linters/type checkers, producing auditable evidence. | 2025-05-16 |
| P0 | Agent Harness Engineering: A Survey | Academic | Harness; Taxonomy; Agent architecture | Survey that frames harness engineering as its own system layer and introduces the ETCLOVG taxonomy: Execution, Tooling, Context, Lifecycle, Observability, Verification, and Governance. | 2026 |
| P0 | Unrolling the Codex agent loop | OpenAI | Harness; Agent loop; Codex | How Codex CLI chains prompt, tool schema, MCP tools, Responses API, and context management into an agent loop. | 2026-01-23 |
| P0 | Unlocking the Codex harness: how we built the App Server | OpenAI | Harness; Codex App Server; JSON-RPC | Core harness article: Codex core, App Server, JSON-RPC, streaming progress, approval, diff, and thread management. | 2026-02-04 |
| P0 | From model to agent: Equipping the Responses API with a computer environment | OpenAI | Harness; Responses API; Sandbox | Responses API + shell tool + hosted containers form the agent runtime; essential for understanding the model-to-agent execution environment. | 2026-03-10 |
| P0 | Harness engineering: leveraging Codex in an agent-first world | OpenAI | Harness; Agent-first engineering | Design product code, tests, CI, docs, and observability to be agent-readable/executable; learn agent-first repo organization. | 2026-02-11 |
| P0 | The next evolution of the Agents SDK | OpenAI | Harness; Agents SDK; MCP; Skills | Agents SDK harness becomes more complete: memory, sandbox orchestration, Codex-like filesystem tools, MCP, skills, AGENTS.md. | 2026-04-15 |
| P0 | Building Consistent Workflows with Codex CLI & Agents SDK | OpenAI | MCP; Codex; Agents SDK | Codex CLI as an MCP server integrated with Agents SDK; real multi-agent dev workflow. | 2025-10-01 |
| P0 | Building Reliable Agents with Memory and Compaction | OpenAI | Memory; Compaction; Reliability | Memory and compaction design for long-context/multi-turn agents. | 2026-05-01 |
| P0 | Build an Agent Improvement Loop with Traces, Evals, and Codex | OpenAI | Evals; Traces; Self-improvement | Connect traces, evals, and Codex fixes into an agent improvement loop. | 2026-05-12 |
| P0 | Eval Driven System Design - From Prototype to Production | OpenAI | Evals; Production | Use evals as the driving force for system design; ideal for moving agents from demo to production. | 2025-06-02 |
| P0 | Testing Agent Skills Systematically with Evals | OpenAI | Evals; Skills; Agents | Systematically test agent skills with evals; establish quality gates before skill release. | 2026-01-22 |
| P0 | Evals API Use-case - MCP Evaluation | OpenAI | MCP; Evals | Evaluate QA/retrieval capabilities with MCP tools; ideal for building an MCP regression suite. | 2025-06-09 |
| P0 | The six generations of AI agents and how to eval them | Braintrust | Evals; Agent architecture; Harness | Maps six generations of agent architecture to the eval strategy each generation requires, from prompts to AI harnesses. | 2026-05-21 |
| P0 | Agent observability powers agent evaluation | LangChain | Evals; Observability; Traces | Explains why traces are the source of truth for agent behavior and how observability feeds evaluation. | 2026-01-27 |
| P0 | Running Codex safely at OpenAI | OpenAI | Safety; Sandbox; Codex | How OpenAI runs Codex internally: sandbox, approvals, network policy, agent-native telemetry. | 2026-05-20 |
| P0 | Building Governed AI Agents - A Practical Guide to Agentic Scaffolding | OpenAI | Governance; Guardrails; Agents | Governed agent scaffolding: permissions, guardrails, auditing, and organizational policies. | 2026-02-23 |
| P0 | Macro Evals for Agentic Systems | OpenAI | Evals; Agentic systems | Evaluate agents at the end-to-end/macro level, not just individual step outputs. | 2026-05-19 |
| P0 | Best practices for Claude Code | Anthropic | Coding agents; Claude Code | Claude Code methodology: verification loop, explore-plan-code, CLAUDE.md, permissions, MCP, subagents, context management. | 2025-04-18 |
| P0 | Best practices for Claude Code | Anthropic | Claude Code; Coding agents; Workflow | Official Claude Code docs for planning, CLAUDE.md, verification, tool use, and team workflows. | Current docs |
| P0 | How Claude Code works in large codebases | Anthropic | Claude Code; Large codebases; Enterprise | Patterns for large-codebase Claude Code adoption: layered CLAUDE.md, hooks, skills, plugins, MCP, LSP, subagents, and rollout ownership. | 2026-05-14 |
| P0 | How we built our multi-agent research system | Anthropic | Agents; Multi-agent; Research | Claude Research multi-agent architecture: planner + parallel research agents + synthesis; production multi-agent experience. | 2025-06-13 |
| P0 | Writing effective tools for AI agents - with AI agents | Anthropic | Tools; MCP; Evals | Tool quality determines agent quality: tool descriptions, context budget, eval, and letting Claude optimize its own tools. | 2025-09-11 |
| P0 | Effective context engineering for AI agents | Anthropic | Context; Agents | Context is the agent's core resource: selection, compression, isolation, persistence, and context pollution control. | 2025-09-29 |
| P0 | How Long Contexts Fail | Drew Breunig | Context; Long context; Agents | Taxonomy of context poisoning, distraction, confusion, and clash; practical fixes for overloaded agent contexts. | 2025-06-22 |
| P0 | Context Rot | Chroma | Context; Long-context evals | Research on how LLM performance degrades as input grows, especially with distractors and similar-but-wrong context. | 2025-07-16 |
| P0 | Enabling Claude Code to work more autonomously | Anthropic | Claude Code; Agent SDK; Subagents | Claude Agent SDK, subagents, hooks, background tasks, checkpoints, and other autonomous coding agent capabilities. | 2025-09-29 |
| P0 | Equipping agents for the real world with Agent Skills | Anthropic | Skills; Agents | Agent Skills as modular capability packages: instructions, resources, scripts — reducing context burden and improving reliability. | 2025-10-16 |
| P0 | Agent Skills | Anthropic | Skills; Claude; Progressive disclosure | Official Claude Agent Skills docs: modular instructions, metadata, scripts, resources, and on-demand loading across Claude products. | Current docs |
| P0 | Skills | OpenAI | Skills; API; Shell environments | Official OpenAI API guide for uploading, managing, and attaching reusable Skills to hosted and local shell environments. | Current docs |
| P0 | Agent Skills Specification | Agent Skills | Skills; Specification; Progressive disclosure | Complete skill package format: SKILL.md frontmatter, optional scripts/references/assets, file references, and validation. | Current docs |
| P0 | Code execution with MCP: Building more efficient agents | Anthropic | MCP; Code execution; Context | Key article on MCP scale challenges: reduce token overhead with code execution/on-demand tools; learn progressive disclosure. | 2025-11-04 |
| P0 | Introducing advanced tool use on Claude Developer Platform | Anthropic | Tools; MCP; Advanced tool use | Tool search, deferred loading, programmatic tool calling; solving context pollution from large numbers of MCP tools. | 2025-11-24 |
| P0 | Effective harnesses for long-running agents | Anthropic | Harness; Long-running agents | Essential harness reading: working across multiple context windows, task logging, external state, agent self-recovery. | 2025-11-26 |
| P0 | Deep Agents | LangChain | Harness; Long-running agents; Deep research | Opinionated open-source agent harness for planning, context management, subagents, filesystem, memory, and human-in-the-loop workflows. | Current repo |
| P0 | Demystifying evals for AI agents | Anthropic | Evals; Agents | Agent evals are more complex than static evals: multi-turn, tools, state changes, creative solutions, failure taxonomy. | 2026-01-09 |
| P0 | Measuring AI agent autonomy in practice | Anthropic | Agents; Autonomy; Measurement | Quantify agent autonomy using metrics like task duration and supervision needs; ideal for building autonomy benchmarks. | 2026-02-18 |
| P0 | Harness design for long-running application development | Anthropic | Harness; Application development | Harness design patterns for delegating long-running app development tasks to agents; compare with OpenAI Codex harness. | 2026-03-24 |
| P0 | Scaling Managed Agents: Decoupling the brain from the hands | Anthropic | Managed agents; Harness | Decouple the model brain from execution hands/harness, keeping interfaces stable as the harness evolves. | 2026-04-08 |
| P0 | How we contain Claude across products | Anthropic | Safety; Containment; Agents | Blast radius of powerful agent releases, human-in-the-loop, and containment strategies. | 2026-05-25 |
| P1 | Structured Outputs for Multi-Agent Systems | OpenAI | Agents; Multi-agent; Structured outputs | Use strict schemas to constrain structured messages and handoffs between multiple agents. | 2024-08-06 |
| P1 | Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku | Anthropic | Agents; Computer use | Claude computer use beta starting point: the model uses a computer via screenshots and actions. | 2024-10-22 |
| P1 | Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet | Anthropic | Agents; Coding; Evals | SWE-bench agent scaffolding article: same model performance strongly depends on harness/scaffolding. | 2025-01-06 |
| P1 | Introducing Operator | OpenAI | Agents; Computer use; Safety | Early product form of browser-based agents: model clicks, types, and executes tasks on web pages, emphasizing user confirmation and safety boundaries. | 2025-01-23 |
| P1 | Computer-Using Agent | OpenAI | Agents; Computer use | Understand how CUA combines vision, mouse/keyboard actions, and environment feedback into an agent loop; compare with Claude computer use. | 2025-01-23 |
| P1 | Claude 3.7 Sonnet and Claude Code | Anthropic | Agents; Coding; Claude Code | Early release of Claude Code, marking Claude's entry into the agentic coding tool space. | 2025-02-24 |
| P1 | The think tool: Enabling Claude to stop and think in complex tool use situations | Anthropic | Tools; Reasoning; Agents | Give the model an explicit think tool in complex tool-use chains; learn tool design for policy-heavy/multi-step decisions. | 2025-03-20 |
| P1 | Evaluating Agents with Langfuse | OpenAI | Evals; Agents | Observe and evaluate Agents SDK runs with Langfuse; learn tracing/eval workflows. | 2025-03-31 |
| P1 | Parallel Agents with the OpenAI Agents SDK | OpenAI | Agents; Parallelism; Agents SDK | Parallel agent patterns: decompose tasks, execute in parallel, aggregate results. | 2025-05-01 |
| P1 | Multi-Agent Portfolio Collaboration with OpenAI Agents SDK | OpenAI | Agents; Multi-agent; Portfolio | Multi-agent collaboration business example: research, analysis, combined output. | 2025-05-28 |
| P1 | MCP-Powered Agentic Voice Framework | OpenAI | MCP; Voice; Agents | Voice agent + MCP paradigm: real-time interaction, tool extension, task execution. | 2025-06-17 |
| P1 | Deep Research API with the Agents SDK | OpenAI | Agents; Deep research; Agents SDK | Integrate Deep Research API into Agents SDK workflows. | 2025-06-25 |
| P1 | Open Deep Research | LangChain | Deep research; LangGraph; MCP | Configurable open-source deep research agent that supports multiple model providers, search tools, MCP servers, and benchmark evaluation. | Current repo |
| P1 | Alibaba-NLP/DeepResearch | Alibaba | Deep research; Open-weight models; Web agents | Open-weight Tongyi DeepResearch model/repo for long-horizon information-seeking benchmarks; useful after learning the harness and API layers. | Current repo |
| P1 | Desktop Extensions: One-click MCP server installation for Claude Desktop | Anthropic | MCP; Claude Desktop; Packaging | Package local MCP servers as one-click install extensions; learn MCP distribution/installation/local permission issues. | 2025-06-26 |
| P1 | Building a Supply-Chain Copilot with OpenAI Agent SDK and Databricks MCP Servers | OpenAI | MCP; Agents; Databricks | Enterprise data platform MCP + Agent SDK business agent example. | 2025-07-08 |
| P1 | Introducing ChatGPT agent: bridging research and action | OpenAI | Agents; ChatGPT; Computer use | End-user-facing ChatGPT agent: combining research, browser, computer use, file/slide capabilities. | 2025-07-17 |
| P1 | ChatGPT agent System Card | OpenAI | Agents; Safety; Evals | Learn pre-launch risk classification, evaluation, permissions, human confirmation, and abuse prevention for agent products. | 2025-07-17 |
| P1 | Context Engineering - Short-Term Memory Management with Sessions | OpenAI | Context; Sessions; Agents | How short-term memory/session state affects agent reliability. | 2025-09-09 |
| P1 | How the Open Knowledge Format can improve data sharing | Google Cloud | Knowledge; Context; Data agents; Standards | Introduces OKF as a YAML-based way to package schemas, metrics, APIs, docs, and governance context for humans and AI agents. | 2025-10-09 |
| P1 | Progressive disclosure | Claude-Mem | Context; Memory; Progressive disclosure | Make retrieval costs visible and let the agent fetch details on demand, reducing context pollution and attention waste. | Current docs |
| P1 | Introducing upgrades to Codex | OpenAI | Agents; Coding; IDE | Codex evolves from research preview to daily dev tool: CLI, IDE, web/mobile collaboration, and more independent task execution. | 2025-09-15 |
| P1 | Introducing Claude Sonnet 4.5 | Anthropic | Agents; Claude Agent SDK; Computer use | Sonnet 4.5 emphasizes coding, complex agents, computer use, with simultaneous Agent SDK launch. | 2025-09-29 |
| P1 | Introducing apps in ChatGPT and the new Apps SDK | OpenAI | MCP; Apps; ChatGPT | Apps SDK extends UI and tool server via MCP; entry point for understanding the ChatGPT app / MCP app ecosystem. | 2025-10-06 |
| P1 | Build your ChatGPT UI | OpenAI | MCP; Apps SDK; UI | Build custom UI components that turn structured MCP tool results into interactive ChatGPT app interfaces. | Current docs |
| P1 | Codex is now generally available | OpenAI | Agents; Coding; Codex SDK | Codex GA, Slack integration, Codex SDK, admin tools; see how coding agents enter enterprise management. | 2025-10-06 |
| P1 | Using PLANS.md for multi-hour problem solving | OpenAI | Codex; Long-running; Planning | ExecPlan files and cross-context task management for multi-hour coding-agent work. | 2025-10-07 |
| P1 | Beyond permission prompts: making Claude Code more secure and autonomous | Anthropic | Safety; Permissions; Claude Code | From simple permission prompts to fine-grained security policies, reducing autonomous mode risk and interruptions. | 2025-10-20 |
| P1 | Introducing Aardvark: OpenAI's agentic security researcher | OpenAI | Agents; Security | Security-domain agent form: continuous scanning, issue verification, fix suggestions; later integrated as Codex Security. | 2025-10-30 |
| P1 | Build a coding agent with GPT 5.1 | OpenAI | Agents; Coding | Build a coding agent from scratch: understand file editing, command execution, loops, and verification. | 2025-11-13 |
| P1 | OpenAI co-founds Agentic AI Foundation | OpenAI | MCP; Standards; AGENTS.md | MCP, AGENTS.md, and agent standards enter the Linux Foundation/AAIF context; understand ecosystem standardization. | 2025-12-09 |
| P1 | Donating MCP and establishing the Agentic AI Foundation | Anthropic | MCP; Standards; AAIF | Anthropic donates MCP to Linux Foundation/AAIF; read alongside OpenAI's AAIF article. | 2025-12-09 |
| P1 | Context Engineering for Personalization - Long-Term Memory Notes | OpenAI | Context; Long-term memory; Agents | How long-term memory serves as agent personalization/state management. | 2026-01-05 |
| P1 | Supercharging Codex with JetBrains MCP at Skyscanner | OpenAI | MCP; Codex; IDE | Real IDE/MCP case study: how Codex CLI accesses IDE context and dev tools via JetBrains MCP. | 2026-01-11 |
| P1 | Designing AI-resistant technical evaluations | Anthropic | Evals; Technical hiring | How strong agents continuously break technical evaluations; relevant to benchmark contamination prevention and eval design. | 2026-01-21 |
| P1 | Agent Evaluation Readiness Checklist | LangChain | Evals; Agents; Checklist | Practical checklist for selecting eval levels, constructing datasets, designing graders, and connecting offline and online evals. | 2026 |
| P1 | Inside OpenAI's in-house data agent | OpenAI | Agents; Data; Memory | Internal data agent case study: memory, Codex, data context, reliability; learn enterprise knowledge/data agents. | 2026-01-29 |
| P1 | Introducing the Codex app | OpenAI | Agents; Coding; Multi-agent | Desktop command center for agents: multi-threaded/parallel long tasks, project-level agent workflows. | 2026-02-02 |
| P1 | Apple's Xcode now supports Claude Agent SDK | Anthropic | Claude Agent SDK; Xcode; MCP | Embed Claude Agent SDK in Xcode: harness, subagents, background tasks, plugins, MCP. | 2026-02-03 |
| P1 | Quantifying infrastructure noise in agentic coding evals | Anthropic | Evals; Coding agents; Infrastructure | Environment configuration significantly impacts scores in agentic coding evals; control infrastructure noise in both production and benchmarks. | 2026-02-05 |
| P1 | Building a C compiler with a team of parallel Claudes | Anthropic | Multi-agent; Coding; Long-running | Parallel Claude teams completing large engineering tasks; learn multi-agent division of labor, coordination, and long-running execution. | 2026-02-05 |
| P1 | Codex Security: now in research preview | OpenAI | Agents; Security; Codex | Productization of an agentic security researcher: vulnerability discovery, verification, fix suggestions, reducing triage noise. | 2026-03-06 |
| P1 | Eval awareness in Claude Opus 4.6's BrowseComp performance | Anthropic | Evals; Agent awareness | Risk of models recognizing/adapting to evaluations; relevant to agent benchmark credibility discussions. | 2026-03-06 |
| P1 | How we built Claude Code auto mode: a safer way to skip permissions | Anthropic | Safety; Permissions; Autonomy | Claude Code auto mode risk classification, allow/block rules, exception handling, and security testing. | 2026-03-25 |
| P1 | How we build evals for Deep Agents | LangChain | Evals; Deep agents; Traces | Targeted eval design for deep agents: select production behaviors, tag evals, inspect traces, and avoid false confidence from broad but shallow suites. | 2026-03-26 |
| P1 | Deep Research Bench | FutureSearch | Evals; Deep research; Benchmark | Benchmark for web research agents using offline web snapshots and carefully curated answers to make results more stable and objective. | 2025-06-25 |
| P1 | How to Evaluate Tool-Calling Agents | Arize | Evals; Tool calling; Trajectories | Evaluation workflow for tool selection, tool arguments, trajectories, and LLM-as-judge scoring of tool-using agents. | 2026 |
| P1 | A Survey on Agent-as-a-Judge | Academic | Evals; Agent-as-a-judge; Survey | Survey of agent-based evaluation methods that extend LLM-as-judge with multi-step reasoning, tools, and external observation. | 2026 |
| P1 | Migrate a Legacy Codebase with Sandbox Agents | OpenAI | Agents; Sandbox; Evals | Sandbox agent evaluation and execution patterns in large legacy code migrations. | 2026-04-07 |
| P1 | Codex for (almost) everything | OpenAI | Agents; Codex; MCP; Plugins | Codex app expanded to Windows/macOS, computer use, in-app browser, memory, plugins, MCP servers. | 2026-04-16 |
| P1 | Computer Use Agents in Daytona Sandboxes | OpenAI | Computer use; Sandbox; Agents | Computer-use agents and sandbox runtimes; compare with Operator/CUA/Claude computer use. | 2026-04-19 |
| P1 | Introducing workspace agents in ChatGPT | OpenAI | Agents; Workspace; Governance | Workspace agents: shared agents, permissions, tools, memory, safeguards; ideal for team collaboration agent design. | 2026-04-22 |
| P1 | Building workspace agents in ChatGPT to complete repeatable, end-to-end work | OpenAI | Workspace agents; ChatGPT | Practical workspace agents for repeatable end-to-end team workflows. | 2026-04-22 |
| P1 | Speeding up agentic workflows with WebSockets in the Responses API | OpenAI | Agents; Latency; Responses API | Optimize latency by treating agentic rollouts as long-lived connections/tasks; learn production agent transport and caching. | 2026-05-01 |
| P1 | Agents for financial services | Anthropic | Agents; Finance; MCP | Ten ready-to-run agent templates, Claude Code/Cowork plugins, Managed Agents cookbooks, MCP app. | 2026-05-05 |
| P1 | Migrate from the Claude Agent SDK to the OpenAI Agents SDK | OpenAI | Agents SDK; Migration | Compare Claude Agent SDK and OpenAI Agents SDK from a migration perspective; ideal for dual-stack learning. | 2026-05-07 |
| P1 | AI agent evaluation: How to test, debug, and improve agents in production | Arize | Evals; Production; Observability | Production-oriented agent eval guide covering planning, memory, traces, debugging, and improvement loops. | 2026 |
| P1 | Build long-running AI agents that pause, resume, and never lose context with ADK | Harness; Long-running agents; ADK | Practical ADK tutorial for durable state machines, persistent sessions, event-driven resume, multi-agent delegation, and evals. | 2026-05-12 | |
| P1 | Building a safe, effective sandbox to enable Codex on Windows | OpenAI | Safety; Sandbox; Codex | Coding agent sandbox design on Windows: file access, network restrictions, approval tradeoffs. | 2026-05-13 |
| P1 | Building self-improving tax agents with Codex | OpenAI | Agents; Evals; Self-improvement | Combine production traces, expert feedback, Codex loop, and eval infrastructure into self-improving business agents. | 2026-05-27 |
| P1 | SchemaFlow: Agentic Database Change Impact Analysis, SQL Generation, and Eval Guardrails | OpenAI | Evals; SQL; Agent guardrails | Guardrails and eval guardrails examples for data/SQL agents. | 2026-06-05 |
| P1 | Agents SDK quickstart | OpenAI | Agents; SDK | Quickly build a minimal agent; understand the code patterns of run, tool, and handoff. | Current docs |
| P1 | OpenAI Agents SDK examples | OpenAI | Agents SDK; Patterns; Examples | Practical examples for agent patterns, MCP, memory, guardrails, approvals, handoffs, and streaming. | Current docs |
| P1 | MCP Apps compatibility in ChatGPT | OpenAI | MCP; Apps SDK; UI | Understand MCP Apps UI standards, iframe/bridge, and compatibility between ChatGPT and other hosts. | Current docs |
| P1 | Use Codex with the Agents SDK | OpenAI | MCP; Codex; Agents SDK | Use Codex as an MCP server for other agents to call; ideal for multi-agent dev workflows. | Current docs |
| P1 | Agent approvals and security - Codex | OpenAI | Safety; Approvals; Codex | Official reference for Codex approval modes, sandbox, network access; read alongside OpenAI/Anthropic safety articles. | Current docs |
| P1 | Agent Skills - Codex | OpenAI | Codex; Skills; Plugins | Skills/Plugins as reusable workflow packages; compare with Anthropic Agent Skills. | Current docs |
| P1 | Skills in OpenAI API | OpenAI | Skills; OpenAI API | Cookbook example for using Skills in the OpenAI API and connecting skill bundles to agent workflows. | Current docs |
| P1 | Custom instructions with AGENTS.md - Codex | OpenAI | AGENTS.md; Context | How AGENTS.md provides persistent project specifications for agents; establish repo-level agent contracts. | Current docs |
| P1 | Agents SDK integrations and observability | OpenAI | Observability; MCP; Tracing | Tracing, MCP integration, provider/observability; essential for production agent debugging. | Current docs |
| P1 | Secure MCP Tunnel | OpenAI | MCP; Security; Private tools | Securely expose private/intranet MCP servers to supported OpenAI surfaces; ideal for enterprise deployment. | Current docs |
| P1 | How Claude Code works | Anthropic | Claude Code; Agentic loop; Harness | Under-the-hood architecture of Claude Code: the agentic loop (gather context → act → verify), built-in tool categories, context window management, and extension points. | Current docs |
| P1 | The spec is dead, long live the spec! | Ravi on Product | Specs; Product; Agentic development | Argues that specs and prompts become durable source material when AI can generate implementation rapidly. | 2025-07-31 |
| P1 | Build a Remote MCP server | Cloudflare | MCP; Remote servers; Authentication | Practical guide to deploying remote MCP servers with Streamable HTTP, OAuth, session state, and authorization boundaries. | Current docs |
| P1 | Introducing the MCP Registry | MCP | MCP; Registry; Discovery | Official preview of the MCP Registry as a source of truth for discovering and distributing public MCP servers. | 2025-09-08 |
| P1 | How Anthropic teams use Claude Code | Anthropic | Claude Code; Team workflows; Case studies | Internal Anthropic examples across data infrastructure, product, security, inference, design, legal, and RL engineering. | Current PDF |
| P1 | SAST vs. DAST vs. RASP | Splunk | Security; AppSec; Testing | Clear comparison of static, dynamic, and runtime application security testing methods for agent safety baselines. | Current article |
| P1 | GitHub Copilot: Remote Code Execution via Prompt Injection | Embrace The Red | Security; Prompt injection; Coding agents | Concrete RCE case study showing how agent-controlled configuration changes can collapse permission boundaries. | 2025-08-12 |
| P1 | Finding vulnerabilities in modern web apps using Claude Code and OpenAI Codex | Semgrep | Security; Coding agents; Vulnerability research | Empirical study of Claude Code and Codex on vulnerability discovery, including true positives, false positives, and failure modes. | 2025-09-02 |
| P1 | AI Agents Are Here. So Are the Threats. | Unit 42 | Security; Agent threats; Prompt injection | Threat scenarios for multi-agent systems: instruction extraction, tool misuse, internal access, impersonation, and RCE. | Current article |
| P1 | OWASP Top Ten | OWASP | Security; Web applications; AppSec | Foundational web application risk taxonomy; useful baseline when asking agents to build or review web apps. | Current project |
| P1 | How to review code effectively | GitHub | Code review; Engineering practice | Staff-engineer philosophy for effective code review: reviewer intent, clarity, scope, and human communication. | Current article |
| P1 | AI-Assisted Assessment of Coding Practices in Modern Code Review | Academic | Code review; AI review; Google | AutoCommenter paper: architecture, deployment, and evaluation of an LLM-assisted code-review system at Google scale. | 2024-05-22 |
| P1 | AI code review implementation and best practices | Graphite | Code review; AI review; Workflow | Implementation checklist for introducing AI code review into repository hooks, policies, team rules, and review workflows. | Current article |
| P1 | ML and LLM system design: 800 case studies to learn from | Evidently AI | ML systems; LLM systems; Case studies | Database of production ML and LLM case studies from 150+ companies, including GenAI, RAG, AI agents, evaluation, and deployment architecture examples. | 2025-12-22 |
| P1 | Introduction to Site Reliability Engineering | SRE; Production; Reliability | Foundational SRE framing: software engineering applied to operations, toil reduction, risk, and reliable production systems. | Current book | |
| P1 | Traces & Spans: Observability Basics You Should Know | Last9 | Observability; Tracing; Production | Practical primer on traces and spans for debugging distributed systems and giving agents useful production evidence. | 2025-04-23 |
| P1 | Kubernetes Troubleshooting in Resolve AI | Resolve AI | SRE agents; Kubernetes; Troubleshooting | Production-agent case study for Kubernetes root-cause analysis across pods, deployments, logs, metrics, and infrastructure signals. | 2026-05-21 |
| P1 | The role of multi agent systems in making software engineers AI-native | Resolve AI | Multi-agent; SRE; AI-native engineering | Argues that production engineering needs specialized multi-agent systems for parallel investigation and domain-aware coordination. | 2026-03-20 |
| P0 | learn-claude-code | Community | Harness; Agent loop; Tools; Context | Hands-on 20-lesson tutorial building a Claude Code–like agent harness from scratch: agent loop, tool integration, context compaction, multi-agent coordination, permissions, MCP plugins. | 2026 |
| P0 | Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems | Academic | Agent architecture; Claude Code; Design space | Deep technical analysis of Claude Code's architecture: agentic loop, permission system, context compaction, extensibility (MCP/plugins/skills/hooks), subagent delegation, and comparison with open-source alternatives. | 2026-04-14 |
| P0 | Function Calling | OpenAI | Tools; Function calling; API | Official guide to function/tool calling: define functions with JSON schemas, handle model tool calls, execute and return results. | Current docs |
| P0 | Tool use overview | Anthropic | Tools; Tool use; API | Connect Claude to external tools and APIs: client vs server tools, the agentic loop, strict schema conformance, and when Claude decides to call tools. | Current docs |
| P0 | Function calling - Gemini API | Tools; Function calling; API | Enable Gemini models to connect with external tools via function calling: single-turn, multi-turn, parallel, and sequential function chains. | Current docs | |
| P2 | Vulnerability Prompt Analysis with O3 | Community | Security; Prompting; Vulnerability research | Concrete vulnerability-analysis prompt used with o3; useful as a prompt artifact to study, not a general framework. | 2025 |
| P2 | Code Reviews: Just Do It | Coding Horror | Code review; Engineering practice | Classic argument for peer review as one of the highest-leverage software quality practices. | 2006-01-21 |
| P2 | Code Review Essentials for Software Teams | Blake Smith | Code review; Team practice | Practical code review hierarchy: shared mental models, design clarity, pull request quality, and constructive feedback. | 2015-02-09 |
| P2 | Lessons from millions of AI code reviews | Greptile | AI code review; Lessons | Talk on patterns from large-scale AI code review usage; useful qualitative context for review-agent design. | Video |
| P2 | Multi-stack Web App Builds | Community | Assignments; Full-stack; Practice | Practice assignment for multi-stack web app builds; useful as an applied testbed for coding agents. | Current repo |
| P2 | AI Production Engineer | Resolve AI | SRE agents; Product case study | Product deep dive on autonomous production engineering agents for alerts, RCA, remediation, and post-incident review. | 2026-03-28 |
| P2 | The Top 5 Benefits of Agentic AI in On-call Engineering | Resolve AI | SRE agents; On-call; Operations | Overview of agentic AI benefits for incident response, dynamic knowledge, and on-call operations. | 2025-07-25 |
| P2 | Orchestrating Agents: Routines and Handoffs (archived) | OpenAI | Agents; Handoffs; Orchestration | Historical cookbook for routines and handoffs; useful conceptually, but archived and not the current recommended implementation path. | 2024-10-10 |
| P2 | Introducing Contextual Retrieval | Anthropic | Context; Retrieval; RAG | Not agent-specific, but important for agent RAG/context: prepend context to chunks before retrieval to improve recall. | 2024-09-19 |
| P2 | Developing a computer use model | Anthropic | Computer use; Agents | More technical explanation of how the computer-use model moves the mouse, clicks, types, and reads screen feedback. | 2024-10-22 |
| P2 | Introducing Claude 4 | Anthropic | Agents; Coding; Long-running | Overview of Claude Opus/Sonnet 4 capabilities: coding, advanced reasoning, agent workflows. | 2025-05-22 |
| P2 | Claude for Financial Services | Anthropic | Agents; Connectors; Finance | Vertical industry agent/connector productization case; understand data, permissions, and tool integration in finance. | 2025-07-15 |
| P2 | Advancing Claude for Financial Services | Anthropic | Agents; Skills; Finance | Claude for Excel, real-time data connectors, pre-built Agent Skills for vertical industry productization. | 2025-10-27 |
| P2 | Introducing GPT-5.3-Codex | OpenAI | Agents; Coding model; Evals | Codex-native model and long-running coding/terminal/agentic benchmarks; understand how model capabilities serve the harness. | 2026-02-05 |
| P2 | Introducing OpenAI Frontier | OpenAI | Agents; Enterprise; Governance | Enterprise AI coworker/agent platform: shared context, onboarding, permissions, guardrails, governance. | 2026-02-10 |
| P2 | Introducing Claude Sonnet 4.6 | Anthropic | Agents; Planning; Computer use | Sonnet 4.6 emphasizes coding, computer use, long-context reasoning, agent planning. | 2026-02-17 |
| P2 | Introducing Claude Opus 4.6 | Anthropic | Agents; Long-running; Tool use | Model release perspective on long-running tasks, agentic harness, subagents, and tool call capabilities. | 2026-02-25 |
| P2 | Introducing Claude Opus 4.7 | Anthropic | Agents; Long-running; Coding | Stronger software engineering and long-running task performance; track how model capabilities impact agent workloads. | 2026-04-16 |
| P2 | An update on recent Claude Code quality reports | Anthropic | Reliability; Claude Code; Agent SDK | Postmortem on Claude Code/Agent SDK quality regression; learn agent product operations and regression control. | 2026-04-23 |
| P2 | Introducing Claude Opus 4.8 | Anthropic | Agents; Dynamic workflows; Long-running | Dynamic workflows, hundreds of parallel subagents, long-running agentic tasks — latest model/product direction. | 2026-05-28 |
| P2 | Codex for every role, tool, and workflow | OpenAI | Agents; Codex; Plugins | Codex expands from development to knowledge work: role-specific plugins, Sites, annotations, parallel workflows. | 2026-06-02 |
| P2 | Codex is becoming a productivity tool for everyone | OpenAI | Agents; Knowledge work | Usage data shows how non-developers use Codex for reports, spreadsheets, research, automation, and lightweight tools. | 2026-06-02 |
| P2 | OpenAI Docs MCP | OpenAI | MCP; Docs; Context | Official OpenAI docs MCP server; connect docs directly to local agents/IDEs. | Current docs |
| P2 | Codex SDK | OpenAI | Codex SDK; Automation | Programmatically control Codex in CI/CD or internal tools; embed coding agents into existing workflows. | Current docs |
| P2 | When AI builds itself | Anthropic | Agents; Recursive self-improvement; Safety | How AI systems accelerate their own development through recursive self-improvement; three possible futures and the need for verifiable coordination. | 2026-05 |
- AI Engineers
- Agent Engineers
- LLM Engineers
- Platform Engineers
- Research Engineers
- AI Startup Founders
Contributions are welcome. If you find:
- New OpenAI resources
- New Anthropic resources
- MCP updates
- Agent evaluation frameworks
- Production engineering articles
Please open a pull request.
The goal of this project is to become the System Design Primer for Agentic Engineering.
If you're serious about building production AI agents, start here.