simonraj79
diff --git a/‎.agents/skills/adk-architecture/SKILL.md‎
Lines changed: 25 additions & 0 deletions b/‎.agents/skills/adk-architecture/SKILL.md‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎.agents/skills/adk-architecture/references/architecture/checkpoint-resume.md‎
Lines changed: 69 additions & 0 deletions b/‎.agents/skills/adk-architecture/references/architecture/checkpoint-resume.md‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎.agents/skills/adk-architecture/references/architecture/context.md‎
Lines changed: 104 additions & 0 deletions b/‎.agents/skills/adk-architecture/references/architecture/context.md‎
Lines changed: 104 additions & 0 deletions
diff --git a/‎.agents/skills/adk-architecture/references/architecture/llm-context-orchestration.md‎
Lines changed: 42 additions & 0 deletions b/‎.agents/skills/adk-architecture/references/architecture/llm-context-orchestration.md‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎.agents/skills/adk-architecture/references/architecture/node-runner.md‎
Lines changed: 76 additions & 0 deletions b/‎.agents/skills/adk-architecture/references/architecture/node-runner.md‎
Lines changed: 76 additions & 0 deletions
@@ -0,0 +1,25 @@
+---
+name: adk-architecture
+description: ADK architectural knowledge — graph orchestration, resumption, execution flow, node contracts, observability, and LLM context orchestration. Use this skill whenever you need to understand the architecture, event flow, or state management of the ADK system, or when designing or modifying core components. Triggers on "how does X work", "design of", "architecture of", "event flow", "resumption state", "checkpoint", "BaseNode", "NodeRunner".
+---
+
+# ADK Architecture Guide
+
+## Core Interfaces (references/interfaces/)
+- [BaseNode](references/interfaces/base-node.md) — node contract, output/streaming, state/routing, HITL, configuration
+- [Workflow](references/interfaces/workflow.md) — graph orchestration, dynamic nodes (tracking/dedup/resume), transitive dynamic nodes, interrupt propagation, design rules for node authors
+- [Runner](references/interfaces/runner.md) — The public interface for executing workflows and agents. Documents entrance methods `run` and `run_async`.
+- [Agent](references/interfaces/agent.md) — Blueprint defining identity, instructions, and tools. Documents that `run` is the preferred entrance method.
+- [BaseAgent](references/interfaces/base-agent.md) — Base class for all agents. Defines the contract for subclassing with `_run_impl` as the primary override point.
+- [Event](references/interfaces/event.md) — Core data structure for state reconstruction and communication. Represents a conversation turn or action.
+
+## Key Principles (references/principles/)
+- [API Principles](references/principles/api-principles.md) — stability, backward compatibility, and self-containment. Use when making design choices that affect the public API surface.
+
+## Runtime Knowledge (references/architecture/)
+- [Context](references/architecture/context.md) — 1:1 node-context mapping, InvocationContext singleton, property reference
+- [NodeRunner](references/architecture/node-runner.md) — two communication channels, execution flow, output delegation. Internal runtime details.
+- [Runner Roles](references/architecture/runner-roles.md) — Runner vs NodeRunner vs Workflow separation. Explains why they are separate to avoid deadlocks.
+- [Checkpoint and Resume](references/architecture/checkpoint-resume.md) — HITL lifecycle, `rerun_on_resume`, `run_id`
+- [Observability](references/architecture/observability.md) — span-on-Context design, NodeRunner integration, correlated logs, metrics
+- [LLM Context Orchestration](references/architecture/llm-context-orchestration.md) — relationship between events and LLM context, task delegation translation, branch isolation. Use when modifying event processing, context preparation for LLMs, or debugging context pollution issues.
@@ -0,0 +1,69 @@
+# Checkpoint and Resume Lifecycle
+
+HITL (Human-in-the-Loop) follows this pattern:
+
+1. **Interrupt**: Node yields an event with `long_running_tool_ids`.
+   Each ancestor propagates the interrupt upward via `ctx.interrupt_ids`.
+2. **Persist**: Only the leaf node's interrupt event is persisted to
+   session. Workflow sets `ctx._interrupt_ids` directly (no internal
+   event needed).
+3. **Resume**: User sends a `FunctionResponse` message. The Runner
+   scans session events to find the matching `invocation_id`, then
+   reconstructs node state from persisted events.
+4. **Continue**: The interrupted node receives the FR and continues
+   execution. Downstream nodes receive the resumed node's output.
+
+## run_id on resume
+
+Resumed nodes reuse the same `run_id` from the original
+execution. From the node's perspective, the execution never paused
+— events before and after the resume share the same run_id.
+
+Fresh dispatches (first run, loop re-trigger) get a new run_id.
+
+## Resume behavior by `rerun_on_resume`
+
+A node with multiple interrupt IDs may receive partial FRs (only
+some resolved). The behavior depends on `rerun_on_resume`:
+
+**`rerun_on_resume=True`** (Workflow, orchestration nodes):
+
+| FRs received | Status | Behavior |
+|---|---|---|
+| Partial | PENDING | Re-execute immediately with partial `resume_inputs`. Node handles remaining interrupts internally (e.g., Workflow dispatches resolved children, keeps unresolved as WAITING). |
+| All | PENDING | Re-execute with all `resume_inputs`. |
+
+This is critical for Workflow — when one child's FR arrives, it
+re-runs immediately to dispatch that resolved child. It doesn't
+wait for all children's FRs.
+
+**`rerun_on_resume=False`** (leaf nodes, simple HITL):
+
+| FRs received | Status | Behavior |
+|---|---|---|
+| Partial | WAITING | Stay waiting. Need all FRs. |
+| All | COMPLETED | Auto-complete. Output = aggregated `resolved_responses`. No re-execution. |
+
+## Resume with prior output and interrupts
+
+A node can produce output AND interrupt in the same execution (e.g.,
+a Workflow where child A completes with output and child B interrupts).
+On resume:
+
+- Some interrupt IDs are resolved (provided in `resume_inputs`)
+- Remaining interrupt IDs carry forward via `prior_interrupt_ids`
+- Prior output carries forward via `prior_output`
+- NodeRunner pre-populates ctx with these values before re-executing
+
+```python
+runner = NodeRunner(
+    node=node, parent_ctx=ctx,
+    run_id=prior_run_id,  # reuse
+    prior_output=cached_output,
+    prior_interrupt_ids={'fc-2'},  # still unresolved
+)
+child_ctx = await runner.run(
+    node_input=input,
+    resume_inputs={'fc-1': response},
+)
+```
@@ -0,0 +1,104 @@
+# Context
+
+## Architecture
+
+The runtime uses two scoping objects:
+
+- **InvocationContext** — singleton per invocation. Holds shared
+  state (session, services, event queue) accessible by all nodes.
+  Pydantic model at `agents/invocation_context.py`.
+- **Context** — one per node execution. Holds per-node results
+  (output, route, interrupt_ids) and provides the API surface for
+  node code. At `agents/context.py`.
+
+Every Context holds a reference to the same InvocationContext
+(`_invocation_context`). Service access (artifacts, memory, auth)
+is delegated through it.
+
+```
+Root Context                      ← created by Runner from IC
+└── Context [runner.node]         ← the root node (e.g., Workflow)
+    ├── Context [child_a]         ← child node A
+    └── Context [child_b]         ← child node B
+        └── Context [grandchild]  ← nested child
+```
+
+The Runner creates `root_ctx = Context(ic)` as the tree root and
+passes it as `parent_ctx` to `NodeRunner(node=self.node)`. The
+root Context has no node_path or run_id — it exists solely
+as the parent for the Runner's root node. All Contexts in the tree
+share the same InvocationContext singleton.
+
+InvocationContext contents:
+
+- `session`, `agent`, `user_content`
+- `invocation_id`, `app_name`, `user_id`
+- Services: `artifact_service`, `memory_service`, `credential_service`
+- `run_config`, `live_request_queue`
+- `process_queue` — shared event queue consumed by the main loop
+
+## 1:1 node-context mapping
+
+Every node execution gets its own Context instance. The relationship
+is strictly 1:1: one node, one Context. The Context tree mirrors the
+node execution tree.
+
+**NodeRunner** creates the child Context from the parent's Context
+via `_create_child_context()`. The child inherits:
+
+- `_invocation_context` — same singleton (shared session, services)
+- `node_path` — parent path + node name (e.g., `wf/child_a`)
+- `run_id` — unique per execution (reused on resume)
+- `event_author` — inherited from parent
+- `schedule_dynamic_node_internal` — inherited from parent
+
+The child does NOT inherit output, route, or interrupt_ids — those
+are per-execution results, starting fresh (unless resume carries
+forward `prior_output` / `prior_interrupt_ids`).
+
+## Node result properties
+
+These properties on Context are the primary mechanism for
+communicating results between nodes:
+
+- **`ctx.output`** — the node's result value. Set once per
+  execution. Can be set via `yield value` (framework sets it) or
+  `ctx.output = X` directly. Second write raises `ValueError`.
+- **`ctx.route`** — routing value for conditional edges. Set
+  independently of output. Workflow-specific.
+- **`ctx.interrupt_ids`** — accumulated interrupt IDs. Read-only
+  for user code. Set by framework when node yields an Event with
+  `long_running_tool_ids`.
+
+Output and interrupts can coexist — the orchestrator's `_finalize`
+decides what to propagate. The orchestrator reads these properties
+after the child node finishes.
+
+## Class hierarchy
+
+```
+ReadonlyContext          (agents/readonly_context.py)
+  └── Context            (agents/context.py)
+```
+
+**ReadonlyContext** — read-only view used in callbacks and plugins:
+- `user_content`, `invocation_id`, `agent_name`
+- `state` (returns `MappingProxyType` — immutable view)
+- `session`, `user_id`, `run_config`
+
+**Context(ReadonlyContext)** — full read-write context for node
+execution. Extends ReadonlyContext with mutable state, node results,
+workflow metadata, and service methods. See property reference below.
+
+## Property reference
+
+| Category | Properties |
+|---|---|
+| State & actions | `state` (mutable `State`), `actions` (EventActions) |
+| Node results | `output`, `route`, `interrupt_ids` (read-only) |
+| Workflow | `node_path`, `run_id`, `triggered_by`, `in_nodes`, `resume_inputs`, `retry_count`, `event_author` |
+| Methods | `run_node()`, `get_next_child_run_id()` |
+| Artifacts | `load_artifact()`, `save_artifact()`, `list_artifacts()` |
+| Memory | `search_memory()`, `add_session_to_memory()`, `add_events_to_memory()`, `add_memory()` |
+| Auth | `request_credential()`, `load_credential()`, `save_credential()` |
+| Tools | `request_confirmation()`, `function_call_id` |
@@ -0,0 +1,42 @@
+# LLM Context Orchestration from Events
+
+## Core Principle
+
+In ADK, there is a clear distinction between the **Event Stream** and the **LLM Context**:
+
+- **Events are the Ground Truth**: They are immutable records of what has happened in a session (user messages, model responses, tool calls, results). They serve as the audit log and persistence state.
+- **LLM Context is an Orchestrated View**: The context passed to an LLM is not merely a dump of the raw event log. It is a carefully orchestrated view, filtered and transformed to match the specific role, task, and branch of the agent currently executing.
+
+## Orchestration Strategies
+
+The framework orchestrates the translation of events into LLM context using several strategies:
+
+### 1. Task Delegation Translation
+
+When a coordinator agent delegates a task to a sub-agent (Task Agent) via a tool call:
+
+- **Source Event**: Coordinator calls a tool like `request_task_<sub_agent_name>(args...)`.
+- **Orchestrated Context**:
+  - The arguments in the `request_task_<sub_agent_name>` tool call are extracted and placed in the **System Instruction (SI)** or treated as the core instruction for the sub-agent.
+  - The first user message presented to the sub-agent is synthesized to represent the goal (e.g., "Finish task of [sub_agent_name] with arguments [args]").
+- **Goal**: Isolate the sub-agent from the coordinator's full history and give it a crisp, clear starting point.
+
+### 2. Branch Isolation
+
+In complex workflows with parallel execution:
+
+- **Source Events**: Events from all nodes and branches are stored in the same session chronologically.
+- **Orchestrated Context**: The framework filters events by `branch` (e.g., `node:path.name`). An agent only sees events that belong to its own execution path.
+- **Goal**: Prevent cross-node event pollution and ensure deterministic behavior in isolated tasks.
+
+### 3. History Trimming and Compaction
+
+To prevent context window overflow and stale instruction loops:
+
+- **Source Events**: A long history of retries, tool calls, and interactions.
+- **Orchestrated Context**: The framework may trim older events or summarize them (event compaction). In task mode, it might keep only the essential setup events, ignoring stale retry loops that would otherwise confuse the LLM.
+- **Goal**: Maintain a focused and efficient context window for the LLM.
+
+## Summary
+
+The relationship is one of **Source vs. View**. Events are the source of truth for the session, while LLM context is a highly orchestrated view of that truth, tailored for the active agent.
@@ -0,0 +1,76 @@
+# NodeRunner
+
+NodeRunner is the per-node executor. It drives `BaseNode.run()`,
+creates the child Context, enriches events, and writes results
+to ctx.
+
+## Two communication channels
+
+The runtime has two distinct channels for data flow:
+
+- **Context** — parent ↔ child communication. Output, route, state,
+  resume_inputs, and interrupt_ids flow through ctx. The orchestrator
+  reads ctx after the child completes to decide what to do next.
+- **Event** — persistence and streaming. Events are appended to the
+  session and streamed to the caller. They carry message, state
+  deltas, function calls, and interrupt markers.
+
+A node writes to **ctx** to communicate with its parent. A node
+yields **Events** to persist data and stream messages to the user.
+
+## Execution flow
+
+```
+Orchestrator
+  │
+  ├─ NodeRunner(node=child, parent_ctx=ctx)
+  │    │
+  │    ├─ _create_child_context()     → child Context
+  │    ├─ _execute_node()             → iterate node.run()
+  │    │    ├─ _track_event_in_context()  → write to ctx
+  │    │    └─ _enqueue_event()           → enrich + persist
+  │    ├─ _flush_output_and_deltas()  → emit deferred output
+  │    └─ return child ctx
+  │
+  └─ reads ctx.output, ctx.route, ctx.interrupt_ids
+```
+
+1. **Create child Context** — inherits `_invocation_context` (shared
+   singleton), builds `node_path` from parent, assigns `run_id`.
+
+2. **Iterate `node.run()`** — for each yielded Event:
+
+   **Track in context** — `_track_event_in_context` writes output,
+   route, and interrupt_ids from the event to ctx (source of truth).
+
+   **Enrich** — `_enrich_event` stamps metadata before persistence:
+   - `event.author` — node name (or `event_author` override)
+   - `event.invocation_id` — from InvocationContext
+   - `event.node_info.path` — full path (e.g., `wf/child_a`)
+   - `event.node_info.run_id` — unique per execution
+   - `event.node_info.output_for` — ancestor paths when
+     `use_as_output=True`
+
+   **Flush deltas** — for non-partial events, `_flush_deltas` moves
+   pending state/artifact deltas from `ctx.actions` onto the event
+   before enqueueing.
+
+   **Enqueue** — `ic.enqueue_event` puts the event on the shared
+   process queue for session persistence.
+
+3. **Flush deferred output** — if `ctx.output` was set directly
+   (not via yield), `_flush_output_and_deltas` emits the output
+   Event after `_run_impl` returns. Bundles any remaining
+   state/artifact deltas onto the same Event.
+
+4. **Return child ctx** — the orchestrator reads `ctx.output`,
+   `ctx.route`, and `ctx.interrupt_ids`.
+
+## Output delegation (`use_as_output`)
+
+When a child is scheduled with `use_as_output=True`, its output
+Event also counts as the parent's output. NodeRunner:
+
+- Sets `ctx._output_delegated = True` on the parent
+- Skips emitting the parent's own output Event
+- Stamps `event.node_info.output_for` with ancestor paths