v3.0 case studies, task spec update, session journal

Douglas Jones · Douglas Jones · commit d8a8826ce1a6 · 2026-05-14T15:35:45.000-04:00
GPT-4o v3.0 case study: 4/5 (P3 EffectViolation, known pattern)
Gemini 2.5 Pro v3.0 case study: 4/5 (P3 FIND-G1, fixed)
Task spec: effects reminder added to Program 3
Session journal: sessions/2026-05-14-v3.md
Dispatch index regenerated
diff --git a/dispatches/2026-05-14-gpt4o-v3-case-study.readout.md b/dispatches/2026-05-14-gpt4o-v3-case-study.readout.md
@@ -0,0 +1,139 @@
+# GPT-4o — v3.0 Case Study
+
+**Date:** 2026-05-14  
+**Persona:** Quill  
+**Model:** GPT-4o  
+**Prompt version:** v3.0 (manifest `sha256:d900fe7e...`, generator `codifide-python-3.0.0`)  
+**Task:** Content moderation pipeline (Programs 1–5)
+
+---
+
+## Score: 4/5 first-attempt successes
+
+| Program | Result | Notes |
+|---------|--------|-------|
+| P1 — Keyword classifier | ✅ First attempt | Correct. Nested `or` → flat `or` was a style note, not a real error. |
+| P2 — Confidence-gated refusal | ✅ First attempt | Correct. Both entry points correct. |
+| P3 — Escalation router | ❌ EffectViolation | `main` calls `io.say` but declares `effects {}`. |
+| P4 — Pipeline with I/O | ✅ First attempt | Correct. Double-print behavior noted. |
+| P5 — Content-addressed composition | ✅ First attempt | Correct structure, placeholder hashes appropriate. |
+
+---
+
+## Program-by-program analysis
+
+### P1 — Keyword classifier ✅
+
+Clean first attempt. GPT-4o noted that `or` is variadic and self-corrected
+from nested `or(a, or(b, c))` to flat `or(a, b, c)`. Both forms work; the
+flat form is idiomatic. Correct use of `lower()` for case normalization.
+
+**Verified:** `python3 -m codifide run` → `safe`. Correct.
+
+### P2 — Confidence-gated refusal ✅
+
+Clean first attempt. Inlined `classify_content` rather than importing it —
+correct choice given the task spec. The `believe label / ge(conf(label), 0.70) => label / else => bottom` pattern is exactly right.
+
+- `main_unsafe` → `unsafe` ✅
+- `main_uncertain` → `uncertain` ✅ (0.75 ≥ 0.70 correctly reasoned)
+
+### P3 — Escalation router ❌
+
+**The failure:** `main` calls `io.say` but declares `effects {}`:
+
+```codifide
+def main
+  intent "test route_message with two messages"
+  sig    () -> Unit
+  effects {}          # ← wrong — io.say requires {io.stdout}
+  cand
+    io.say(route_message("this message contains spam"))
+    io.say(route_message("hello world"))
+```
+
+**Runtime error:** `EffectViolation: 'test route_message with two messages' performed effect 'io.stdout' which is not in its declared set []`
+
+**The fix:** `effects {io.stdout}` on `main`. One character change.
+
+**What GPT-4o got right:** `route_message` itself is correctly declared pure
+(`effects {}`). The routing logic using `if/then/else` is correct and idiomatic.
+The `else bottom` removal was sound reasoning (moderate already propagates bottom).
+
+**What went wrong:** GPT-4o added `io.say` calls to `main` for testing but
+forgot to update `main`'s effects declaration. This is the same class of error
+as T1-3 and T1-4 — the transitive effect rule is understood in principle but
+missed in practice when adding I/O to a test harness.
+
+**Verified fix:** `effects {io.stdout}` on `main` → `blocked\nescalate-to-human`. ✅
+
+### P4 — Pipeline with I/O ✅
+
+Clean first attempt. Correctly declared `effects {io.stdout}` on both
+`run_pipeline` and `main`. The `decision <- route_message(message) / io.say(decision) / decision` pattern is correct — bind, print, return. Double-print noted.
+
+**Verified:** `blocked\nblocked`. ✅
+
+### P5 — Content-addressed composition ✅
+
+Correct structure. Imported `classify_content` and `route_message` (skipped
+`moderate` — reasonable since `route_message` inlines `moderate` in this
+version). Placeholder hashes used appropriately.
+
+---
+
+## Findings for the A-Team
+
+### No new findings
+
+The P3 failure is a known pattern — transitive effect declaration missed when
+adding I/O to a test harness. This appeared in T1-3 and T1-4 as well. It is
+documented in AGENT_QUICKREF.md under "Every `def` must declare `effects`."
+
+The fix is always the same: add the effect label to the declaring function.
+The error message is clear (`performed effect 'io.stdout' which is not in its
+declared set []`). No new documentation or parser changes needed.
+
+---
+
+## Comparison to prior case studies
+
+| Session | Model | Score | Key failure |
+|---------|-------|-------|-------------|
+| T1-1 | GPT-4o | 3/5 | P3, P5 |
+| T1-2 | GPT-4o | 4/5 | P5 |
+| T1-3 | Claude | 4/5 | P5 |
+| T1-4 | Claude | 3/5 | P3, P5 |
+| v2.0 Relay | GPT-4o | 5/5 | — |
+| v3.0 Gemini 2.5 Pro | Gemini 2.5 Pro | 4/5 | P3 (FIND-G1, fixed) |
+| **v3.0 GPT-4o** | **GPT-4o** | **4/5** | **P3 (effects {} on io.say main)** |
+
+GPT-4o scores 4/5 on the v3.0 prompt. The failure is a known pattern, not a
+new finding. Programs 1, 2, 4, and 5 were all first-attempt successes.
+
+Notable: GPT-4o's P3 failure is different from Gemini's P3 failure. Gemini
+hit a parser constraint (believe arm formatting, now fixed). GPT-4o hit a
+runtime constraint (missing effect declaration). Both are P3 failures but
+from different root causes.
+
+---
+
+## Assessment
+
+Two consecutive 4/5 runs (Gemini, GPT-4o) with different P3 failure modes
+suggests P3 is the hardest program in the task spec. The failures are:
+
+- **Gemini:** believe arm formatting (parser, now fixed)
+- **GPT-4o:** missing `effects {io.stdout}` on test harness `main`
+- **T1-4 Claude:** bind-before-when (parser, fixed in v2.0)
+
+The common thread: P3 is where agents add complexity (routing logic, I/O,
+or multi-step dispatch) and make a small structural mistake. The task spec
+may benefit from a note reminding agents to check effect declarations on
+every `def` they write, not just the "main" function.
+
+No action items beyond filing this readout.
+
+---
+
+*Filed by: Douglas Jones + Claude*
diff --git a/dispatches/2026-05-14-gpt4o-v3-case-study.yaml b/dispatches/2026-05-14-gpt4o-v3-case-study.yaml
@@ -0,0 +1,24 @@
+id: 2026-05-14-gpt4o-v3-case-study
+date: 2026-05-14
+persona: Quill
+kind: case-study
+title: "GPT-4o — v3.0 Case Study"
+model: gpt-4o
+prompt_version: v3.0
+manifest_hash: sha256:d900fe7e6d91300424b226cda0fd404bf281c4362a70131dbec116548b310ff2
+score: 4/5
+programs:
+  P1: pass
+  P2: pass
+  P3: fail
+  P4: pass
+  P5: pass
+findings: []
+p3_failure: >
+  main calls io.say but declares effects {}. EffectViolation at runtime.
+  Known pattern (T1-3, T1-4). No new finding. Fix: effects {io.stdout} on main.
+summary: >
+  4/5 first-attempt successes. P3 failed due to missing effects {io.stdout}
+  on test harness main — known pattern, no new finding. P1 self-corrected
+  nested or to flat or (style, not error). Strong performance on P2 (believe
+  pattern), P4 (double-print noted), P5 (correct import structure).
diff --git a/dispatches/2026-05-14-gpt4o-v3-session-close.readout.md b/dispatches/2026-05-14-gpt4o-v3-session-close.readout.md
@@ -0,0 +1,54 @@
+# Session Close — 2026-05-14 (GPT-4o v3.0 case study)
+
+**Date:** 2026-05-14  
+**Persona:** Quill  
+**Tests:** 386 passing, 0 skipped, 0 failed  
+**Dispatch check:** exits 0, all pairs complete
+
+---
+
+## What happened this session
+
+Ran the GPT-4o case study against the v3.0 prompt.
+
+### GPT-4o v3.0 case study
+
+Score: **4/5** first-attempt successes.
+
+- P1 ✅ — keyword classifier, clean first attempt (nested or → flat or, style only)
+- P2 ✅ — confidence-gated refusal, both entry points correct
+- P3 ❌ — `main` calls `io.say` but declares `effects {}` → EffectViolation
+- P4 ✅ — pipeline with I/O, double-print noted
+- P5 ✅ — content-addressed composition, correct import structure
+
+No new findings. P3 failure is a known pattern (T1-3, T1-4): missing
+`effects {io.stdout}` on a test harness `main` that calls `io.say`.
+
+Filed: `2026-05-14-gpt4o-v3-case-study.{readout.md,yaml}`
+
+---
+
+## State at close
+
+- Tests: **386 passing, 0 skipped**
+- Dispatch check: exits 0
+- No code changes this session
+
+## Case study summary (v3.0 prompt)
+
+| Model | Score | P3 failure mode |
+|-------|-------|-----------------|
+| Gemini 2.5 Pro | 4/5 | believe arm formatting (fixed) |
+| GPT-4o | 4/5 | effects {} on io.say main (known pattern) |
+
+P3 is the consistent weak point. Both failures are different root causes.
+The effects declaration miss is documented but not prominent enough in the
+task spec. Consider adding a reminder to check effects on every def,
+including test harness functions.
+
+## Handoff for next session
+
+Options:
+1. Run Claude against the v3.0 prompt (only model not tested post-v2.0 fixes)
+2. Add a P3 effects reminder to the task spec and re-run
+3. Assess v4.0 scope based on accumulated findings
diff --git a/dispatches/2026-05-14-gpt4o-v3-session-close.yaml b/dispatches/2026-05-14-gpt4o-v3-session-close.yaml
@@ -0,0 +1,12 @@
+id: 2026-05-14-gpt4o-v3-session-close
+date: 2026-05-14
+persona: Quill
+kind: session-close
+title: "Session Close — GPT-4o v3.0 case study"
+status: closed
+tests: 386
+skipped: 0
+dispatch_check: exits 0
+items_this_session:
+  - GPT-4o v3.0 case study (4/5, no new findings)
+open_items: none
diff --git a/dispatches/2026-05-14-task-spec-effects-reminder.readout.md b/dispatches/2026-05-14-task-spec-effects-reminder.readout.md
@@ -0,0 +1,35 @@
+# Task Spec Update — Effects reminder on Program 3
+
+**Date:** 2026-05-14  
+**Persona:** Quill  
+**Trigger:** GPT-4o v3.0 case study P3 failure; same pattern in T1-3 and T1-4
+
+---
+
+## What changed
+
+`docs/GPT4O_PROMPT.md` — Program 3 (Escalation router) gains an effects reminder:
+
+> **Effects reminder:** Check the `effects` declaration on every `def` you write,
+> including `main`. If `main` calls `io.say` for testing, it must declare
+> `effects {io.stdout}`. The runtime enforces effect declarations transitively
+> and raises `EffectViolation` if one is missing — even on a test harness function.
+
+## Why
+
+Three case studies (T1-3, T1-4, GPT-4o v3.0) hit the same failure at P3:
+a test harness `main` that calls `io.say` but declares `effects {}`. The
+error message is clear (`EffectViolation`) but the mistake is easy to make
+when adding I/O to a function that was previously pure. The reminder is
+placed at the point of failure — Program 3 is where agents first add I/O
+to a test harness.
+
+## What was NOT changed
+
+- The task spec structure is unchanged
+- No new programs added
+- AGENT_QUICKREF.md already documents the effects rule; no change needed there
+
+---
+
+*Filed by: Douglas Jones + Claude*
diff --git a/dispatches/2026-05-14-task-spec-effects-reminder.yaml b/dispatches/2026-05-14-task-spec-effects-reminder.yaml
@@ -0,0 +1,13 @@
+id: 2026-05-14-task-spec-effects-reminder
+date: 2026-05-14
+persona: Quill
+kind: maintenance
+title: "Task Spec Update — Effects reminder on Program 3"
+status: complete
+trigger: GPT-4o v3.0 P3 failure; same pattern T1-3, T1-4
+files_changed:
+  - docs/GPT4O_PROMPT.md
+summary: >
+  Added effects reminder to Program 3 of the task spec. Three case studies
+  hit the same EffectViolation on a test harness main that called io.say
+  without declaring effects {io.stdout}. Reminder placed at point of failure.
diff --git a/dispatches/2026-05-14-task-spec-update-session-close.readout.md b/dispatches/2026-05-14-task-spec-update-session-close.readout.md
@@ -0,0 +1,35 @@
+# Session Close — 2026-05-14 (task spec effects reminder)
+
+**Date:** 2026-05-14  
+**Persona:** Quill  
+**Tests:** 386 passing, 0 skipped, 0 failed  
+**Dispatch check:** exits 0, all pairs complete
+
+---
+
+## What happened this session
+
+Following the GPT-4o v3.0 case study (4/5, P3 EffectViolation), added an
+effects reminder to Program 3 of the task spec.
+
+### Task spec update
+
+`docs/GPT4O_PROMPT.md` Program 3 gains an effects reminder at the point of
+failure. Three case studies (T1-3, T1-4, GPT-4o v3.0) hit the same pattern:
+test harness `main` calls `io.say` but declares `effects {}`.
+
+Filed: `2026-05-14-task-spec-effects-reminder.{readout.md,yaml}`
+
+---
+
+## State at close
+
+- Tests: **386 passing, 0 skipped**
+- Dispatch check: exits 0
+- No code changes this session
+
+## Handoff for next session
+
+Prompt is updated. Run Claude against the v3.0 prompt — it's the only model
+not tested post-v2.0 fixes. With the effects reminder in place, a 5/5 run
+is achievable.
diff --git a/dispatches/2026-05-14-task-spec-update-session-close.yaml b/dispatches/2026-05-14-task-spec-update-session-close.yaml
@@ -0,0 +1,12 @@
+id: 2026-05-14-task-spec-update-session-close
+date: 2026-05-14
+persona: Quill
+kind: session-close
+title: "Session Close — task spec effects reminder"
+status: closed
+tests: 386
+skipped: 0
+dispatch_check: exits 0
+items_this_session:
+  - Task spec effects reminder added to Program 3
+open_items: none
diff --git a/dispatches/INDEX.md b/dispatches/INDEX.md
@@ -20,9 +20,13 @@ Filename convention:
 | `find-g1-believe-multiline-arm` |  | [md](./2026-05-14-find-g1-believe-multiline-arm.readout.md) | [yaml](./2026-05-14-find-g1-believe-multiline-arm.yaml) |  |
 | `gemini-v3-session-close` |  | [md](./2026-05-14-gemini-v3-session-close.readout.md) | [yaml](./2026-05-14-gemini-v3-session-close.yaml) |  |
 | `gemini25pro-v3-case-study` |  | [md](./2026-05-14-gemini25pro-v3-case-study.readout.md) | [yaml](./2026-05-14-gemini25pro-v3-case-study.yaml) |  |
+| `gpt4o-v3-case-study` |  | [md](./2026-05-14-gpt4o-v3-case-study.readout.md) | [yaml](./2026-05-14-gpt4o-v3-case-study.yaml) |  |
+| `gpt4o-v3-session-close` |  | [md](./2026-05-14-gpt4o-v3-session-close.readout.md) | [yaml](./2026-05-14-gpt4o-v3-session-close.yaml) |  |
 | `gpt5-case-study` | GPT-5.4 case study — live content moderation pipeline run | [md](./2026-05-14-gpt5-case-study.readout.md) | [yaml](./2026-05-14-gpt5-case-study.yaml) |  |
 | `relay-v2-case-study` | Relay v2.0 KPI validation — Claude Sonnet 4.6 content-moderation case study | [md](./2026-05-14-relay-v2-case-study.readout.md) | [yaml](./2026-05-14-relay-v2-case-study.yaml) |  |
 | `session-close` | session close — v3.0 session: V3-1 and V3-2 shipped | [md](./2026-05-14-session-close.readout.md) | [yaml](./2026-05-14-session-close.yaml) |  |
+| `task-spec-effects-reminder` |  | [md](./2026-05-14-task-spec-effects-reminder.readout.md) | [yaml](./2026-05-14-task-spec-effects-reminder.yaml) |  |
+| `task-spec-update-session-close` |  | [md](./2026-05-14-task-spec-update-session-close.readout.md) | [yaml](./2026-05-14-task-spec-update-session-close.yaml) |  |
 | `v2-1-rpc-api-complete` | RPC API complete — V2-1 gate | [md](./2026-05-14-v2-1-rpc-api-complete.readout.md) | [yaml](./2026-05-14-v2-1-rpc-api-complete.yaml) |  |
 | `v2-1-rpc-api-design` | RPC API design — V2-1-1 and V2-1-2 | [md](./2026-05-14-v2-1-rpc-api-design.readout.md) | [yaml](./2026-05-14-v2-1-rpc-api-design.yaml) |  |
 | `v2-1-rpc-api-sable` |  |  |  | [md](./2026-05-14-v2-1-rpc-api-sable-audit.md) |
diff --git a/docs/GPT4O_PROMPT.md b/docs/GPT4O_PROMPT.md
@@ -528,6 +528,11 @@ Write a function `route_message` that:
 
 Add a `main` that calls `route_message` with two test messages.
 
+**Effects reminder:** Check the `effects` declaration on every `def` you write,
+including `main`. If `main` calls `io.say` for testing, it must declare
+`effects {io.stdout}`. The runtime enforces effect declarations transitively
+and raises `EffectViolation` if one is missing — even on a test harness function.
+
 **Run it:** `python3 -m codifide run escalation_router.cod`
 
 ---
diff --git a/sessions/2026-05-14-v3.md b/sessions/2026-05-14-v3.md