An operate-time framework, built around the most common FDE case there is: "deploy a customer-support agent for a client. What components does it have? The client says it's too slow — what's the root cause? Here are the request logs — what do you look for?" That's three questions in sequence, and they share one substrate: the component map. You debug against the architecture. This file gives you the reference architecture, the latency decomposition, and a worked log-triage drill.
Pairs with
agent-shapes-catalog.md(the shapes these components are built from) and4-dimensional-testing.md(production observability as a test dimension).
When asked "what components does the agent have," don't list model + prompt. Walk the request path. A production CS agent is a pipeline of specialized components, each with one job and its own failure and latency profile:
| # | Component | Job | Deterministic or LLM |
|---|---|---|---|
| 1 | Gateway / auth | Authenticate the user, load their entitlements/tenant | Deterministic |
| 2 | Intake / PII scrub | Normalize the message; redact PII before it hits the model | Deterministic |
| 3 | Router / classifier | Intent + urgency; route to self-serve, agent, or human | LLM or small classifier |
| 4 | Retriever | Pull relevant KB articles / account context (RAG) | Deterministic (vector + keyword) |
| 5 | Context builder | Assemble the prompt: policy, retrieved docs, account state | Deterministic |
| 6 | Drafter | Generate the response | LLM |
| 7 | Policy / compliance critic | Check the draft against policy (refunds, promises, tone) | Deterministic + LLM-as-judge |
| 8 | Tool executor | Take gated actions (order lookup, refund, ticket update) | Deterministic, per-tool authz |
| 9 | Escalation / HITL | Hand off to a human when confidence or risk fails a threshold | Deterministic gate |
| 10 | Response gate | Final output scan (PII, secrets, forbidden content) before send | Deterministic |
| 11 | Observability / trace | Immutable per-request trace: inputs, spans, latencies, decisions | Deterministic |
The senior signal: naming the deterministic gates (2, 7, 8, 9, 10) as separate from the LLM synthesis (6) — the same hybrid discipline the rest of the repo teaches. The trace (11) is what makes Parts 2 and 3 possible; you can't debug what you didn't instrument.
"It's slow" is not a diagnosis. Decompose end-to-end latency by component and find the tail. The buckets, roughly in request order:
| Bucket | Where the time goes | How to confirm | Common fix |
|---|---|---|---|
| Network / ingress | Client↔server round trips, TLS, payload size | Compare server-side span sum vs client-observed total | CDN/edge, smaller payloads, keep-alive |
| Auth / entitlement | Slow authz lookup or downstream identity call | Auth span duration in the trace | Cache entitlements per session |
| Retrieval | Vector search + KB fetch; large top-k; cold index | Retrieval span; top-k size; index warm state | Lower top-k, cache hot docs, warm the index |
| Model inference | Time-to-first-token + generation; model size; output length | TTFT vs total generation in the trace | Smaller/faster model for routing; stream tokens; cap output length |
| Tool / API calls | External systems (order DB, payments); sequential vs parallel | Per-tool span; are calls serialized? | Parallelize independent calls; timeout + cache |
| Orchestration overhead | Agent loops, re-planning, retries | Step count per request; retry count | Cap steps; fix the retry trigger |
| Cold start | First request after a scale-to-zero idle | Latency correlated with a preceding idle gap | Warm pool / min replicas |
| Queueing / concurrency | Requests waiting on a saturated worker pool or rate limit | Queue-wait span; concurrency vs limit | Scale workers; raise/shard rate limits |
| Guardrail / critic passes | Extra LLM calls for the critic or a re-draft loop | Critic span; re-draft count | Make the critic deterministic; single-pass |
The method, said out loud: "First I'd look at whether the p50 is slow or only the p99 — a slow p50 is systemic (model, retrieval, cold start); a slow tail is usually queueing, retries, or a fallback to a bigger model. Then I'd decompose one slow trace by span and find which component owns the time. Averages lie; I'd work off p95/p99, not the mean." Naming p50-vs-tail and averages-lie is the differentiator.
Prompt (what an interviewer hands you): "Here are per-request trace logs. The client says the agent got slow this week. What do you look for?"
Sample (each line is one request's span breakdown, ms):
req total auth retrieval ttft gen tools steps model
001 1,420 30 180 220 640 order:150 1 sonnet
002 1,380 28 170 210 610 order:160 1 sonnet
003 9,850 31 175 215 630 order:145 1 sonnet <-- outlier
004 1,510 35 240 230 660 order:160 1 sonnet
005 14,200 29 182 980 5,900 order:150,kb:120 4 opus <-- outlier
006 1,460 30 190 225 640 order:155 1 sonnet
007 8,900 33 170 — — — 6 sonnet <-- outlier
What you look for (the reference checklist):
- p50 vs tail. p50 is ~1.4s and healthy; only some requests blow up. So it's a tail problem, not systemic. Don't touch the model or retrieval baseline.
- Decompose each outlier by span:
- req 003 — every span is normal but
totalis 9.8s. The time is unaccounted by the spans, which means it's outside the instrumented work: queue wait or a cold start. Check concurrency vs worker limit at that timestamp. - req 005 —
model=opus(not sonnet),gen=5,900ms,steps=4. This request fell back to a bigger model and looped. Root cause is an escalation/fallback trigger firing when it shouldn't, or a re-plan loop. Both the model swap and the step count are the smoking gun. - req 007 —
steps=6, andttft/gen/toolsare blank (—): the request never completed a generation — likely a retry storm or a tool timeout causing re-planning. Look at the retry/timeout logs for that request.
- req 003 — every span is normal but
- Correlate to "this week." What changed? A deploy, a prompt change that raised the fallback rate, a KB re-index that cooled the cache, a traffic increase that saturated the pool. The logs point at which component; the change-log tells you why now.
- What's missing from the logs that you'd add: queue-wait span, retry count, cache hit/miss, and cost per request. If req 003's slowness is unaccounted, the instrumentation itself has a gap — and "add the missing span" is a legitimate answer.
The one-sentence diagnosis: "Baseline is healthy; the tail is three distinct causes — queue/cold-start (003), an over-eager model-fallback-and-loop (005), and a retry storm (007) — so I'd fix the fallback trigger and the retry policy first since those are self-inflicted, then add queue-wait and retry-count to the trace to confirm 003."
That answer — separating baseline from tail, attributing each outlier to a named component, and flagging the instrumentation gap — is what a senior FDE sounds like on this case.
- "What components?" → walk the 11-row request path (Part 1), grouping deterministic gates vs LLM synthesis.
- "Why is it slow?" → p50 vs tail first, then decompose one trace by the Part 2 buckets, then correlate to what changed this week.
- "Read these logs." → run the Part 3 checklist: baseline vs outliers, attribute each outlier to a span/component, name the missing instrumentation.
COMPONENTS (debug against these):
gateway/auth · PII-scrub · router · retriever · context-builder ·
drafter(LLM) · policy-critic · tool-executor · escalation/HITL ·
response-gate · trace
(name the deterministic GATES separately from the LLM drafter)
LATENCY ROOT-CAUSE:
1. p50 slow = systemic (model / retrieval / cold start)
tail slow = queueing / retries / model-fallback
2. decompose ONE trace by span; find the component that owns the time
3. correlate to what changed this week (deploy, reindex, traffic)
work off p95/p99 — averages lie
LOG TRIAGE:
baseline vs outliers -> attribute each outlier to a named span ->
flag the missing instrumentation (queue-wait, retries, cache, cost)