|
| 1 | +# GPT-4o — v3.0 Case Study |
| 2 | + |
| 3 | +**Date:** 2026-05-14 |
| 4 | +**Persona:** Quill |
| 5 | +**Model:** GPT-4o |
| 6 | +**Prompt version:** v3.0 (manifest `sha256:d900fe7e...`, generator `codifide-python-3.0.0`) |
| 7 | +**Task:** Content moderation pipeline (Programs 1–5) |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Score: 4/5 first-attempt successes |
| 12 | + |
| 13 | +| Program | Result | Notes | |
| 14 | +|---------|--------|-------| |
| 15 | +| P1 — Keyword classifier | ✅ First attempt | Correct. Nested `or` → flat `or` was a style note, not a real error. | |
| 16 | +| P2 — Confidence-gated refusal | ✅ First attempt | Correct. Both entry points correct. | |
| 17 | +| P3 — Escalation router | ❌ EffectViolation | `main` calls `io.say` but declares `effects {}`. | |
| 18 | +| P4 — Pipeline with I/O | ✅ First attempt | Correct. Double-print behavior noted. | |
| 19 | +| P5 — Content-addressed composition | ✅ First attempt | Correct structure, placeholder hashes appropriate. | |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## Program-by-program analysis |
| 24 | + |
| 25 | +### P1 — Keyword classifier ✅ |
| 26 | + |
| 27 | +Clean first attempt. GPT-4o noted that `or` is variadic and self-corrected |
| 28 | +from nested `or(a, or(b, c))` to flat `or(a, b, c)`. Both forms work; the |
| 29 | +flat form is idiomatic. Correct use of `lower()` for case normalization. |
| 30 | + |
| 31 | +**Verified:** `python3 -m codifide run` → `safe`. Correct. |
| 32 | + |
| 33 | +### P2 — Confidence-gated refusal ✅ |
| 34 | + |
| 35 | +Clean first attempt. Inlined `classify_content` rather than importing it — |
| 36 | +correct choice given the task spec. The `believe label / ge(conf(label), 0.70) => label / else => bottom` pattern is exactly right. |
| 37 | + |
| 38 | +- `main_unsafe` → `unsafe` ✅ |
| 39 | +- `main_uncertain` → `uncertain` ✅ (0.75 ≥ 0.70 correctly reasoned) |
| 40 | + |
| 41 | +### P3 — Escalation router ❌ |
| 42 | + |
| 43 | +**The failure:** `main` calls `io.say` but declares `effects {}`: |
| 44 | + |
| 45 | +```codifide |
| 46 | +def main |
| 47 | + intent "test route_message with two messages" |
| 48 | + sig () -> Unit |
| 49 | + effects {} # ← wrong — io.say requires {io.stdout} |
| 50 | + cand |
| 51 | + io.say(route_message("this message contains spam")) |
| 52 | + io.say(route_message("hello world")) |
| 53 | +``` |
| 54 | + |
| 55 | +**Runtime error:** `EffectViolation: 'test route_message with two messages' performed effect 'io.stdout' which is not in its declared set []` |
| 56 | + |
| 57 | +**The fix:** `effects {io.stdout}` on `main`. One character change. |
| 58 | + |
| 59 | +**What GPT-4o got right:** `route_message` itself is correctly declared pure |
| 60 | +(`effects {}`). The routing logic using `if/then/else` is correct and idiomatic. |
| 61 | +The `else bottom` removal was sound reasoning (moderate already propagates bottom). |
| 62 | + |
| 63 | +**What went wrong:** GPT-4o added `io.say` calls to `main` for testing but |
| 64 | +forgot to update `main`'s effects declaration. This is the same class of error |
| 65 | +as T1-3 and T1-4 — the transitive effect rule is understood in principle but |
| 66 | +missed in practice when adding I/O to a test harness. |
| 67 | + |
| 68 | +**Verified fix:** `effects {io.stdout}` on `main` → `blocked\nescalate-to-human`. ✅ |
| 69 | + |
| 70 | +### P4 — Pipeline with I/O ✅ |
| 71 | + |
| 72 | +Clean first attempt. Correctly declared `effects {io.stdout}` on both |
| 73 | +`run_pipeline` and `main`. The `decision <- route_message(message) / io.say(decision) / decision` pattern is correct — bind, print, return. Double-print noted. |
| 74 | + |
| 75 | +**Verified:** `blocked\nblocked`. ✅ |
| 76 | + |
| 77 | +### P5 — Content-addressed composition ✅ |
| 78 | + |
| 79 | +Correct structure. Imported `classify_content` and `route_message` (skipped |
| 80 | +`moderate` — reasonable since `route_message` inlines `moderate` in this |
| 81 | +version). Placeholder hashes used appropriately. |
| 82 | + |
| 83 | +--- |
| 84 | + |
| 85 | +## Findings for the A-Team |
| 86 | + |
| 87 | +### No new findings |
| 88 | + |
| 89 | +The P3 failure is a known pattern — transitive effect declaration missed when |
| 90 | +adding I/O to a test harness. This appeared in T1-3 and T1-4 as well. It is |
| 91 | +documented in AGENT_QUICKREF.md under "Every `def` must declare `effects`." |
| 92 | + |
| 93 | +The fix is always the same: add the effect label to the declaring function. |
| 94 | +The error message is clear (`performed effect 'io.stdout' which is not in its |
| 95 | +declared set []`). No new documentation or parser changes needed. |
| 96 | + |
| 97 | +--- |
| 98 | + |
| 99 | +## Comparison to prior case studies |
| 100 | + |
| 101 | +| Session | Model | Score | Key failure | |
| 102 | +|---------|-------|-------|-------------| |
| 103 | +| T1-1 | GPT-4o | 3/5 | P3, P5 | |
| 104 | +| T1-2 | GPT-4o | 4/5 | P5 | |
| 105 | +| T1-3 | Claude | 4/5 | P5 | |
| 106 | +| T1-4 | Claude | 3/5 | P3, P5 | |
| 107 | +| v2.0 Relay | GPT-4o | 5/5 | — | |
| 108 | +| v3.0 Gemini 2.5 Pro | Gemini 2.5 Pro | 4/5 | P3 (FIND-G1, fixed) | |
| 109 | +| **v3.0 GPT-4o** | **GPT-4o** | **4/5** | **P3 (effects {} on io.say main)** | |
| 110 | + |
| 111 | +GPT-4o scores 4/5 on the v3.0 prompt. The failure is a known pattern, not a |
| 112 | +new finding. Programs 1, 2, 4, and 5 were all first-attempt successes. |
| 113 | + |
| 114 | +Notable: GPT-4o's P3 failure is different from Gemini's P3 failure. Gemini |
| 115 | +hit a parser constraint (believe arm formatting, now fixed). GPT-4o hit a |
| 116 | +runtime constraint (missing effect declaration). Both are P3 failures but |
| 117 | +from different root causes. |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## Assessment |
| 122 | + |
| 123 | +Two consecutive 4/5 runs (Gemini, GPT-4o) with different P3 failure modes |
| 124 | +suggests P3 is the hardest program in the task spec. The failures are: |
| 125 | + |
| 126 | +- **Gemini:** believe arm formatting (parser, now fixed) |
| 127 | +- **GPT-4o:** missing `effects {io.stdout}` on test harness `main` |
| 128 | +- **T1-4 Claude:** bind-before-when (parser, fixed in v2.0) |
| 129 | + |
| 130 | +The common thread: P3 is where agents add complexity (routing logic, I/O, |
| 131 | +or multi-step dispatch) and make a small structural mistake. The task spec |
| 132 | +may benefit from a note reminding agents to check effect declarations on |
| 133 | +every `def` they write, not just the "main" function. |
| 134 | + |
| 135 | +No action items beyond filing this readout. |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +*Filed by: Douglas Jones + Claude* |
0 commit comments