Skip to content

Commit d8a8826

Browse files
author
Douglas Jones
committed
v3.0 case studies, task spec update, session journal
GPT-4o v3.0 case study: 4/5 (P3 EffectViolation, known pattern) Gemini 2.5 Pro v3.0 case study: 4/5 (P3 FIND-G1, fixed) Task spec: effects reminder added to Program 3 Session journal: sessions/2026-05-14-v3.md Dispatch index regenerated
1 parent 7263794 commit d8a8826

11 files changed

Lines changed: 514 additions & 0 deletions
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# GPT-4o — v3.0 Case Study
2+
3+
**Date:** 2026-05-14
4+
**Persona:** Quill
5+
**Model:** GPT-4o
6+
**Prompt version:** v3.0 (manifest `sha256:d900fe7e...`, generator `codifide-python-3.0.0`)
7+
**Task:** Content moderation pipeline (Programs 1–5)
8+
9+
---
10+
11+
## Score: 4/5 first-attempt successes
12+
13+
| Program | Result | Notes |
14+
|---------|--------|-------|
15+
| P1 — Keyword classifier | ✅ First attempt | Correct. Nested `or` → flat `or` was a style note, not a real error. |
16+
| P2 — Confidence-gated refusal | ✅ First attempt | Correct. Both entry points correct. |
17+
| P3 — Escalation router | ❌ EffectViolation | `main` calls `io.say` but declares `effects {}`. |
18+
| P4 — Pipeline with I/O | ✅ First attempt | Correct. Double-print behavior noted. |
19+
| P5 — Content-addressed composition | ✅ First attempt | Correct structure, placeholder hashes appropriate. |
20+
21+
---
22+
23+
## Program-by-program analysis
24+
25+
### P1 — Keyword classifier ✅
26+
27+
Clean first attempt. GPT-4o noted that `or` is variadic and self-corrected
28+
from nested `or(a, or(b, c))` to flat `or(a, b, c)`. Both forms work; the
29+
flat form is idiomatic. Correct use of `lower()` for case normalization.
30+
31+
**Verified:** `python3 -m codifide run``safe`. Correct.
32+
33+
### P2 — Confidence-gated refusal ✅
34+
35+
Clean first attempt. Inlined `classify_content` rather than importing it —
36+
correct choice given the task spec. The `believe label / ge(conf(label), 0.70) => label / else => bottom` pattern is exactly right.
37+
38+
- `main_unsafe``unsafe`
39+
- `main_uncertain``uncertain` ✅ (0.75 ≥ 0.70 correctly reasoned)
40+
41+
### P3 — Escalation router ❌
42+
43+
**The failure:** `main` calls `io.say` but declares `effects {}`:
44+
45+
```codifide
46+
def main
47+
intent "test route_message with two messages"
48+
sig () -> Unit
49+
effects {} # ← wrong — io.say requires {io.stdout}
50+
cand
51+
io.say(route_message("this message contains spam"))
52+
io.say(route_message("hello world"))
53+
```
54+
55+
**Runtime error:** `EffectViolation: 'test route_message with two messages' performed effect 'io.stdout' which is not in its declared set []`
56+
57+
**The fix:** `effects {io.stdout}` on `main`. One character change.
58+
59+
**What GPT-4o got right:** `route_message` itself is correctly declared pure
60+
(`effects {}`). The routing logic using `if/then/else` is correct and idiomatic.
61+
The `else bottom` removal was sound reasoning (moderate already propagates bottom).
62+
63+
**What went wrong:** GPT-4o added `io.say` calls to `main` for testing but
64+
forgot to update `main`'s effects declaration. This is the same class of error
65+
as T1-3 and T1-4 — the transitive effect rule is understood in principle but
66+
missed in practice when adding I/O to a test harness.
67+
68+
**Verified fix:** `effects {io.stdout}` on `main``blocked\nescalate-to-human`. ✅
69+
70+
### P4 — Pipeline with I/O ✅
71+
72+
Clean first attempt. Correctly declared `effects {io.stdout}` on both
73+
`run_pipeline` and `main`. The `decision <- route_message(message) / io.say(decision) / decision` pattern is correct — bind, print, return. Double-print noted.
74+
75+
**Verified:** `blocked\nblocked`. ✅
76+
77+
### P5 — Content-addressed composition ✅
78+
79+
Correct structure. Imported `classify_content` and `route_message` (skipped
80+
`moderate` — reasonable since `route_message` inlines `moderate` in this
81+
version). Placeholder hashes used appropriately.
82+
83+
---
84+
85+
## Findings for the A-Team
86+
87+
### No new findings
88+
89+
The P3 failure is a known pattern — transitive effect declaration missed when
90+
adding I/O to a test harness. This appeared in T1-3 and T1-4 as well. It is
91+
documented in AGENT_QUICKREF.md under "Every `def` must declare `effects`."
92+
93+
The fix is always the same: add the effect label to the declaring function.
94+
The error message is clear (`performed effect 'io.stdout' which is not in its
95+
declared set []`). No new documentation or parser changes needed.
96+
97+
---
98+
99+
## Comparison to prior case studies
100+
101+
| Session | Model | Score | Key failure |
102+
|---------|-------|-------|-------------|
103+
| T1-1 | GPT-4o | 3/5 | P3, P5 |
104+
| T1-2 | GPT-4o | 4/5 | P5 |
105+
| T1-3 | Claude | 4/5 | P5 |
106+
| T1-4 | Claude | 3/5 | P3, P5 |
107+
| v2.0 Relay | GPT-4o | 5/5 ||
108+
| v3.0 Gemini 2.5 Pro | Gemini 2.5 Pro | 4/5 | P3 (FIND-G1, fixed) |
109+
| **v3.0 GPT-4o** | **GPT-4o** | **4/5** | **P3 (effects {} on io.say main)** |
110+
111+
GPT-4o scores 4/5 on the v3.0 prompt. The failure is a known pattern, not a
112+
new finding. Programs 1, 2, 4, and 5 were all first-attempt successes.
113+
114+
Notable: GPT-4o's P3 failure is different from Gemini's P3 failure. Gemini
115+
hit a parser constraint (believe arm formatting, now fixed). GPT-4o hit a
116+
runtime constraint (missing effect declaration). Both are P3 failures but
117+
from different root causes.
118+
119+
---
120+
121+
## Assessment
122+
123+
Two consecutive 4/5 runs (Gemini, GPT-4o) with different P3 failure modes
124+
suggests P3 is the hardest program in the task spec. The failures are:
125+
126+
- **Gemini:** believe arm formatting (parser, now fixed)
127+
- **GPT-4o:** missing `effects {io.stdout}` on test harness `main`
128+
- **T1-4 Claude:** bind-before-when (parser, fixed in v2.0)
129+
130+
The common thread: P3 is where agents add complexity (routing logic, I/O,
131+
or multi-step dispatch) and make a small structural mistake. The task spec
132+
may benefit from a note reminding agents to check effect declarations on
133+
every `def` they write, not just the "main" function.
134+
135+
No action items beyond filing this readout.
136+
137+
---
138+
139+
*Filed by: Douglas Jones + Claude*
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
id: 2026-05-14-gpt4o-v3-case-study
2+
date: 2026-05-14
3+
persona: Quill
4+
kind: case-study
5+
title: "GPT-4o — v3.0 Case Study"
6+
model: gpt-4o
7+
prompt_version: v3.0
8+
manifest_hash: sha256:d900fe7e6d91300424b226cda0fd404bf281c4362a70131dbec116548b310ff2
9+
score: 4/5
10+
programs:
11+
P1: pass
12+
P2: pass
13+
P3: fail
14+
P4: pass
15+
P5: pass
16+
findings: []
17+
p3_failure: >
18+
main calls io.say but declares effects {}. EffectViolation at runtime.
19+
Known pattern (T1-3, T1-4). No new finding. Fix: effects {io.stdout} on main.
20+
summary: >
21+
4/5 first-attempt successes. P3 failed due to missing effects {io.stdout}
22+
on test harness main — known pattern, no new finding. P1 self-corrected
23+
nested or to flat or (style, not error). Strong performance on P2 (believe
24+
pattern), P4 (double-print noted), P5 (correct import structure).
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Session Close — 2026-05-14 (GPT-4o v3.0 case study)
2+
3+
**Date:** 2026-05-14
4+
**Persona:** Quill
5+
**Tests:** 386 passing, 0 skipped, 0 failed
6+
**Dispatch check:** exits 0, all pairs complete
7+
8+
---
9+
10+
## What happened this session
11+
12+
Ran the GPT-4o case study against the v3.0 prompt.
13+
14+
### GPT-4o v3.0 case study
15+
16+
Score: **4/5** first-attempt successes.
17+
18+
- P1 ✅ — keyword classifier, clean first attempt (nested or → flat or, style only)
19+
- P2 ✅ — confidence-gated refusal, both entry points correct
20+
- P3 ❌ — `main` calls `io.say` but declares `effects {}` → EffectViolation
21+
- P4 ✅ — pipeline with I/O, double-print noted
22+
- P5 ✅ — content-addressed composition, correct import structure
23+
24+
No new findings. P3 failure is a known pattern (T1-3, T1-4): missing
25+
`effects {io.stdout}` on a test harness `main` that calls `io.say`.
26+
27+
Filed: `2026-05-14-gpt4o-v3-case-study.{readout.md,yaml}`
28+
29+
---
30+
31+
## State at close
32+
33+
- Tests: **386 passing, 0 skipped**
34+
- Dispatch check: exits 0
35+
- No code changes this session
36+
37+
## Case study summary (v3.0 prompt)
38+
39+
| Model | Score | P3 failure mode |
40+
|-------|-------|-----------------|
41+
| Gemini 2.5 Pro | 4/5 | believe arm formatting (fixed) |
42+
| GPT-4o | 4/5 | effects {} on io.say main (known pattern) |
43+
44+
P3 is the consistent weak point. Both failures are different root causes.
45+
The effects declaration miss is documented but not prominent enough in the
46+
task spec. Consider adding a reminder to check effects on every def,
47+
including test harness functions.
48+
49+
## Handoff for next session
50+
51+
Options:
52+
1. Run Claude against the v3.0 prompt (only model not tested post-v2.0 fixes)
53+
2. Add a P3 effects reminder to the task spec and re-run
54+
3. Assess v4.0 scope based on accumulated findings
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
id: 2026-05-14-gpt4o-v3-session-close
2+
date: 2026-05-14
3+
persona: Quill
4+
kind: session-close
5+
title: "Session Close — GPT-4o v3.0 case study"
6+
status: closed
7+
tests: 386
8+
skipped: 0
9+
dispatch_check: exits 0
10+
items_this_session:
11+
- GPT-4o v3.0 case study (4/5, no new findings)
12+
open_items: none
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Task Spec Update — Effects reminder on Program 3
2+
3+
**Date:** 2026-05-14
4+
**Persona:** Quill
5+
**Trigger:** GPT-4o v3.0 case study P3 failure; same pattern in T1-3 and T1-4
6+
7+
---
8+
9+
## What changed
10+
11+
`docs/GPT4O_PROMPT.md` — Program 3 (Escalation router) gains an effects reminder:
12+
13+
> **Effects reminder:** Check the `effects` declaration on every `def` you write,
14+
> including `main`. If `main` calls `io.say` for testing, it must declare
15+
> `effects {io.stdout}`. The runtime enforces effect declarations transitively
16+
> and raises `EffectViolation` if one is missing — even on a test harness function.
17+
18+
## Why
19+
20+
Three case studies (T1-3, T1-4, GPT-4o v3.0) hit the same failure at P3:
21+
a test harness `main` that calls `io.say` but declares `effects {}`. The
22+
error message is clear (`EffectViolation`) but the mistake is easy to make
23+
when adding I/O to a function that was previously pure. The reminder is
24+
placed at the point of failure — Program 3 is where agents first add I/O
25+
to a test harness.
26+
27+
## What was NOT changed
28+
29+
- The task spec structure is unchanged
30+
- No new programs added
31+
- AGENT_QUICKREF.md already documents the effects rule; no change needed there
32+
33+
---
34+
35+
*Filed by: Douglas Jones + Claude*
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
id: 2026-05-14-task-spec-effects-reminder
2+
date: 2026-05-14
3+
persona: Quill
4+
kind: maintenance
5+
title: "Task Spec Update — Effects reminder on Program 3"
6+
status: complete
7+
trigger: GPT-4o v3.0 P3 failure; same pattern T1-3, T1-4
8+
files_changed:
9+
- docs/GPT4O_PROMPT.md
10+
summary: >
11+
Added effects reminder to Program 3 of the task spec. Three case studies
12+
hit the same EffectViolation on a test harness main that called io.say
13+
without declaring effects {io.stdout}. Reminder placed at point of failure.
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Session Close — 2026-05-14 (task spec effects reminder)
2+
3+
**Date:** 2026-05-14
4+
**Persona:** Quill
5+
**Tests:** 386 passing, 0 skipped, 0 failed
6+
**Dispatch check:** exits 0, all pairs complete
7+
8+
---
9+
10+
## What happened this session
11+
12+
Following the GPT-4o v3.0 case study (4/5, P3 EffectViolation), added an
13+
effects reminder to Program 3 of the task spec.
14+
15+
### Task spec update
16+
17+
`docs/GPT4O_PROMPT.md` Program 3 gains an effects reminder at the point of
18+
failure. Three case studies (T1-3, T1-4, GPT-4o v3.0) hit the same pattern:
19+
test harness `main` calls `io.say` but declares `effects {}`.
20+
21+
Filed: `2026-05-14-task-spec-effects-reminder.{readout.md,yaml}`
22+
23+
---
24+
25+
## State at close
26+
27+
- Tests: **386 passing, 0 skipped**
28+
- Dispatch check: exits 0
29+
- No code changes this session
30+
31+
## Handoff for next session
32+
33+
Prompt is updated. Run Claude against the v3.0 prompt — it's the only model
34+
not tested post-v2.0 fixes. With the effects reminder in place, a 5/5 run
35+
is achievable.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
id: 2026-05-14-task-spec-update-session-close
2+
date: 2026-05-14
3+
persona: Quill
4+
kind: session-close
5+
title: "Session Close — task spec effects reminder"
6+
status: closed
7+
tests: 386
8+
skipped: 0
9+
dispatch_check: exits 0
10+
items_this_session:
11+
- Task spec effects reminder added to Program 3
12+
open_items: none

dispatches/INDEX.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,13 @@ Filename convention:
2020
| `find-g1-believe-multiline-arm` | | [md](./2026-05-14-find-g1-believe-multiline-arm.readout.md) | [yaml](./2026-05-14-find-g1-believe-multiline-arm.yaml) | |
2121
| `gemini-v3-session-close` | | [md](./2026-05-14-gemini-v3-session-close.readout.md) | [yaml](./2026-05-14-gemini-v3-session-close.yaml) | |
2222
| `gemini25pro-v3-case-study` | | [md](./2026-05-14-gemini25pro-v3-case-study.readout.md) | [yaml](./2026-05-14-gemini25pro-v3-case-study.yaml) | |
23+
| `gpt4o-v3-case-study` | | [md](./2026-05-14-gpt4o-v3-case-study.readout.md) | [yaml](./2026-05-14-gpt4o-v3-case-study.yaml) | |
24+
| `gpt4o-v3-session-close` | | [md](./2026-05-14-gpt4o-v3-session-close.readout.md) | [yaml](./2026-05-14-gpt4o-v3-session-close.yaml) | |
2325
| `gpt5-case-study` | GPT-5.4 case study — live content moderation pipeline run | [md](./2026-05-14-gpt5-case-study.readout.md) | [yaml](./2026-05-14-gpt5-case-study.yaml) | |
2426
| `relay-v2-case-study` | Relay v2.0 KPI validation — Claude Sonnet 4.6 content-moderation case study | [md](./2026-05-14-relay-v2-case-study.readout.md) | [yaml](./2026-05-14-relay-v2-case-study.yaml) | |
2527
| `session-close` | session close — v3.0 session: V3-1 and V3-2 shipped | [md](./2026-05-14-session-close.readout.md) | [yaml](./2026-05-14-session-close.yaml) | |
28+
| `task-spec-effects-reminder` | | [md](./2026-05-14-task-spec-effects-reminder.readout.md) | [yaml](./2026-05-14-task-spec-effects-reminder.yaml) | |
29+
| `task-spec-update-session-close` | | [md](./2026-05-14-task-spec-update-session-close.readout.md) | [yaml](./2026-05-14-task-spec-update-session-close.yaml) | |
2630
| `v2-1-rpc-api-complete` | RPC API complete — V2-1 gate | [md](./2026-05-14-v2-1-rpc-api-complete.readout.md) | [yaml](./2026-05-14-v2-1-rpc-api-complete.yaml) | |
2731
| `v2-1-rpc-api-design` | RPC API design — V2-1-1 and V2-1-2 | [md](./2026-05-14-v2-1-rpc-api-design.readout.md) | [yaml](./2026-05-14-v2-1-rpc-api-design.yaml) | |
2832
| `v2-1-rpc-api-sable` | | | | [md](./2026-05-14-v2-1-rpc-api-sable-audit.md) |

docs/GPT4O_PROMPT.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -528,6 +528,11 @@ Write a function `route_message` that:
528528

529529
Add a `main` that calls `route_message` with two test messages.
530530

531+
**Effects reminder:** Check the `effects` declaration on every `def` you write,
532+
including `main`. If `main` calls `io.say` for testing, it must declare
533+
`effects {io.stdout}`. The runtime enforces effect declarations transitively
534+
and raises `EffectViolation` if one is missing — even on a test harness function.
535+
531536
**Run it:** `python3 -m codifide run escalation_router.cod`
532537

533538
---

0 commit comments

Comments
 (0)