Skip to content

Commit a3e0cde

Browse files
LEANDERANTONYclaude
andcommitted
docs: ADR-031 + DEVLOG Day 59 — resume-builder agentic architecture
Documents the architecture across Slices 1A through 1F (commits 055fd60, 78184fe, ff2c93a, bcd64d5, 23ec230, 22a4e73, a2699d8, 674c994). The ADR captures the non-obvious decisions worth preserving: - Native Responses-API tool calling over LangGraph (zero clear value-add at this scope for an enormous dependency cost; ChatGPT itself runs on this loop) - `run_tool_loop` lives on `OpenAIService` so the cross-cutting concerns (budget enforcement, cost-trace, usage recording) stay in one place - Iteration cap = 12, raised from 5 after the QA replay caught the serial-fetch regression (6 sequential GitHub READMEs exhausted the original cap → regex fallback → quality regression) - Tool registry + JSON-error contract: errors are first-class outputs, never raised across the tool boundary, so the loop can react to a tool failure without crashing the turn - **The function-wrap pattern for `web_search`** — the most non-obvious decision. OpenAI's built-in `{"type": "web_search"}` is INCOMPATIBLE with JSON mode (`400 - "Web Search cannot be used with JSON mode."`). Our intake REQUIRES json_object. So we wrap web_search as a function tool whose dispatcher fires a separate inner responses.create without json_object. Main loop stays JSON-mode-safe; agent gets research capability on-demand. - **Schema-strictness pact-tests** as mandatory CI guard against the `dict[K, V]` silent-400 trap (the structuring schema bug from Day 58) - **Pair-registry pact-tests** as mandatory CI guard against the silent-fallback drift class (the RESUME_THEMES vs SUPPORTED_THEMES bug from Day 58) - Character-budget history slicing (replaces the hard `[-12:]` cap) - `proactive_offer` as a distinct JSON channel + click-to-accept chip - `pending_followups` with the TRIGGER PRIORITY rule for open-ended questions ("what else?" → surface oldest follow-up first) Explicit non-decisions called out: - LangGraph not added - External web-search provider (Tavily/Brave/Exa) not integrated yet - `pending_followups` has no UI panel in v1 (data flows through API) - Eval at 10 fixtures, not the 15-20 the parked plan calls for DEVLOG Day 59 covers Slices 1E and 1F in narrative form for future session readers — what the bugs were, why the chosen fix was the right one, what the verification shape looks like. Added to docs/adr/README.md under the "AI workflow" section, right after ADR-016 (the original conversational resume-builder ADR) since ADR-031 explicitly extends it from form-filler to tool-using agent. Verification: 202 hermetic tests green across the affected suites. 10/10 LLM scenarios green on the live API in the agentic runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 674c994 commit a3e0cde

3 files changed

Lines changed: 514 additions & 0 deletions

File tree

docs/DEVLOG.md

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1942,3 +1942,187 @@ This session, end-to-end:
19421942
Six commits. The user's original "the projects sections couldn't
19431943
generate the data" complaint and the modern_blue-looks-like-classic_ats
19441944
complaint both resolved, with regression tests pinning each fix down.
1945+
1946+
## Day 59: Phase 2 of the agentic upgrade — promise tracking, web_search, ADR-031
1947+
1948+
Slices 1E and 1F close out Phase 2 from `report.md`. ADR-031 documents
1949+
the whole arc (1A through 1F).
1950+
1951+
### Slice 1E: `pending_followups[]` promise tracking
1952+
1953+
The agent today acknowledges deferrals ("we can do the summary later
1954+
based on the projects") and then evaporates them. Slice 1E gives the
1955+
agent a memory across turns:
1956+
1957+
- New session field: `pending_followups: list[str]` — each entry a
1958+
short topic string the LLM owns end-to-end.
1959+
- New JSON channels on the intake turn: `add_followups: list[str]`
1960+
(new commitments captured this turn) + `resolved_followups:
1961+
list[str]` (items addressed this turn). The service applies
1962+
resolutions first (substring + case-insensitive match — the model
1963+
sometimes paraphrases its own wording), then adds new commitments
1964+
(dedupe by case-insensitive equality), then caps at 12 outstanding.
1965+
- Prompt receives an `Outstanding Follow-ups` block each turn so the
1966+
agent can see what's open.
1967+
- TRIGGER PRIORITY rule made the behavior reliable: when the user
1968+
asks an open-ended question (`"what else?"` / `"what's next?"` /
1969+
`"anything missing?"`), the agent surfaces the OLDEST outstanding
1970+
follow-up before asking for a new field. Without this rule,
1971+
gpt-5.4@medium preferred new collection questions and the eval
1972+
flagged it on the first calibrated run.
1973+
- Session persistence (export/restore JSON) carries the field through
1974+
with a backward-compatible default of `[]` for older saves.
1975+
- `_serialize_session` exposes `pending_followups` so an optional UI
1976+
surface can render an "outstanding items" panel later (v1 relies on
1977+
the agent's natural assistant_message + proactive_offer behavior).
1978+
1979+
Verification:
1980+
- 3 hermetic unit tests for the apply logic (add → resolve via
1981+
paraphrased substring; 14 items + dedupe + cap-at-12; export →
1982+
restore round-trip preserves the list).
1983+
- 1 new conversational eval scenario
1984+
(`promise_tracking_remembers_deferred_publication`): user defers a
1985+
publication on turn 3, gives skills on turn 4, asks "what else do
1986+
you need from me?" on turn 5 → agent must add the deferral to
1987+
follow-ups AND resurface it on turn 5 (behavior matcher accepts
1988+
"publication" / "paper" / "earlier you mentioned" / "graph
1989+
neural").
1990+
- 8/8 LLM scenarios green on the live API. Commit `a2699d8`.
1991+
1992+
### Slice 1F: `web_search` tool via function-wrapped OpenAI built-in
1993+
1994+
The user's original ask included "you have all the capabilities to
1995+
access urls if provided or browse web yourself." Slice 1A gave the
1996+
GitHub-URL path (`fetch_github_readme`); Slice 1F delivers the
1997+
general-web path.
1998+
1999+
The non-obvious decision (worth preserving for future readers):
2000+
2001+
**OpenAI's built-in `{"type": "web_search"}` is INCOMPATIBLE with
2002+
JSON mode.** The API returns:
2003+
```
2004+
400 - "Web Search cannot be used with JSON mode."
2005+
```
2006+
Our intake contract REQUIRES `text.format = json_object` (structured
2007+
envelope with `draft_updates` / `assistant_message` / `status` /
2008+
etc.). Removing JSON mode would force ad-hoc parsing and degrade
2009+
reliability. So the naive "add `web_search` to the tool list"
2010+
approach silently 400'd every intake turn and the service fell back
2011+
to the regex step-machine — exactly the silent-fallback pattern
2012+
Slice 1D pact-tests were built to catch. The agentic eval surfaced
2013+
the regression immediately: 3/10 passing.
2014+
2015+
The function-wrap is the fix:
2016+
- `web_search` is exposed as a FUNCTION tool to the agent (`{"type":
2017+
"function", "name": "web_search", "parameters": {"query": ...}}`).
2018+
- When the agent calls it, the dispatcher fires a SEPARATE inner
2019+
`responses.create` — WITHOUT `json_object`, WITH OpenAI's built-in
2020+
`{"type": "web_search"}` enabled — and returns the synthesized
2021+
text as the function_call_output.
2022+
- Main loop stays JSON-mode-safe; the agent gets a research
2023+
capability on-demand.
2024+
- Zero new dependencies, no new API key, no new HTTP client. Same
2025+
shape as `fetch_github_readme` from the loop's perspective.
2026+
2027+
Cost shape: each invocation = one extra `responses.create` call
2028+
(gpt-5.4-mini, ~600 tokens). Realistic usage per session: 0-2
2029+
invocations (prompt explicitly tells the agent "use SPARINGLY").
2030+
Latency: +1-2s per search.
2031+
2032+
The prompt teaches the agent:
2033+
- DO use it for: "what does a Senior MLE role at Anthropic typically
2034+
expect?", "what's standard for a fintech compliance officer
2035+
resume?", "compare Stripe vs Adyen engineering bar".
2036+
- DO NOT use it for: anything the user already shared, generic
2037+
resume advice, small talk, speculative queries ("what salary will
2038+
I get?" — refuse politely instead).
2039+
- When citing, attribute the source ("based on what I read on
2040+
Levels.fyi…") rather than asserting as fact.
2041+
2042+
Hermetic tests (`tests/backend/test_resume_builder_tools.py`,
2043+
6 new cases via a stubbed OpenAI client):
2044+
- success path returns synthesized text + asserts the inner call
2045+
does NOT use json_object format
2046+
- empty query → reject
2047+
- no openai_service → structured error
2048+
- inner-call exception → captured as `search_dispatch_failed`,
2049+
never raised across the tool boundary
2050+
- oversize result → truncated at 8 KB with `…[truncated]` marker
2051+
- `execute_tool` does NOT leak `openai_service` into
2052+
`fetch_github_readme`'s kwargs (would crash since fetch is
2053+
HTTP-only)
2054+
2055+
Conversational eval scenarios (2 new):
2056+
- `web_search_fires_on_external_context_question`: user asks
2057+
"what does Anthropic typically look for on a Senior MLE resume?"
2058+
→ agent fires web_search → grounded answer with source attribution
2059+
- `web_search_skipped_for_user_provided_info`: user is sharing their
2060+
own background → agent does NOT burn a search (no "according to"
2061+
citations in the reply)
2062+
2063+
Verification:
2064+
- 145 hermetic tests across affected suites green
2065+
- 10/10 LLM scenarios pass on the live API on the first calibrated
2066+
run. Inspection of the external-context scenario shows the model
2067+
fires `web_search` ONCE, receives a grounded answer ("Anthropic's
2068+
Senior MLE postings tend to emphasize strong Python + ML +
2069+
software engineering, production ML systems, and measurable impact
2070+
like scale, latency, reliability, or cost improvements..."), and
2071+
synthesizes a tailored reply with no hallucination.
2072+
2073+
Commit `674c994`.
2074+
2075+
### ADR-031: documenting the whole arc
2076+
2077+
`docs/adr/ADR-031-resume-builder-agentic-architecture.md` records the
2078+
architecture decisions across Slices 1A through 1F:
2079+
- Native Responses-API tool calling over LangGraph (zero clear
2080+
value-add at this scope for an enormous dependency cost)
2081+
- `run_tool_loop` lives on `OpenAIService` (cross-cutting concerns
2082+
like budget + cost-trace already live there)
2083+
- Iteration cap = 12 (raised from 5 after the QA replay caught the
2084+
serial-fetch regression)
2085+
- Tool registry + JSON-error contract (errors are first-class
2086+
outputs, never raised across the tool boundary)
2087+
- The function-wrap pattern for `web_search` (and the JSON-mode
2088+
incompatibility rationale)
2089+
- Schema-strictness pact-tests as mandatory CI guard against the
2090+
`dict[K, V]` silent-400 trap
2091+
- Pair-registry pact-tests as mandatory CI guard against the
2092+
silent-fallback drift class (the
2093+
`RESUME_THEMES` vs `SUPPORTED_THEMES` bug that surfaced this
2094+
session is the canonical example)
2095+
- Character-budget history slicing (replaces the hard `[-12:]` cap)
2096+
- `proactive_offer` as a distinct JSON channel + click-to-accept
2097+
chip in the UI
2098+
- `pending_followups` with the TRIGGER PRIORITY rule for open-ended
2099+
questions
2100+
2101+
The ADR explicitly names the trade-offs we're NOT making (no
2102+
LangGraph, no external search provider yet, no UI surface for
2103+
`pending_followups`, eval at 10 not 15-20) and the follow-ups
2104+
parked for Phase 3 (eval expansion, external search provider eval,
2105+
UI surface, multi-provider agentic eval).
2106+
2107+
### Session arc, end-to-end
2108+
2109+
The user's original three complaints, all resolved:
2110+
2111+
| Complaint | Resolution |
2112+
|---|---|
2113+
| Agent hallucinated URL-fetching capability | `fetch_github_readme` tool + honesty patch (1A) |
2114+
| "Each new question feels like a fresh GPT instance" | Full conversation history with char-budget guard (1B) |
2115+
| Agent forgot deferred commitments | `pending_followups[]` with TRIGGER PRIORITY (1E) |
2116+
2117+
Plus two silent-fallback bugs that lived in production for weeks
2118+
were caught + pact-tested:
2119+
- Schema 400 on `dict[K, V]` (the structuring LLM had been failing
2120+
on every export since the schema landed — see Day 58)
2121+
- Theme registry drift (4 of 6 themes silently rendering as
2122+
classic_ats since Phase 2a — see Day 58)
2123+
2124+
Phase 1 + Phase 2 of the parked agentic-upgrade plan: **shipped**.
2125+
What's parked for Phase 3: eval expansion to 15-20 fixtures, an
2126+
optional `pending_followups` UI panel, external web-search provider
2127+
integration (only if a quality gap surfaces), and a multi-provider
2128+
agentic eval when ADR-028 D1 lands.

0 commit comments

Comments
 (0)