You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/training-data-vs-practice.adoc
+105Lines changed: 105 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -101,6 +101,111 @@ Four findings, each robust across the four models:
101
101
102
102
The fourth point is the strongest form of the argument. We could not test older model generations — Claude 3.x is no longer reachable through this account — but we did not need to. The gap shows up on the latest models; older ones only widen it.
103
103
104
+
== Cross-Model Validation
105
+
https://github.com/JensGrote[Jens Grote]
106
+
2026-06-09
107
+
108
+
_The experiment above tested only Claude tiers. A reproduction of the full A–E battery against GPT-5, GPT-5-mini and Gemini 2.5 Flash confirms the mechanism is model-family-independent — and surfaces a third failure mode the Claude-only test could not reveal._
109
+
110
+
=== Setup
111
+
112
+
The same five framings (A–E) and five prior-mapping probes (P1–P5) were run against three non-Claude models, each in a clean session without system prompts or custom instructions:
113
+
114
+
* **GPT-5** (OpenAI, May 2026)
115
+
* **GPT-5-mini** (OpenAI, May 2026)
116
+
* **Gemini 2.5 Flash** (Google, May 2026)
117
+
118
+
Raw outputs: `anchor-activation-test-20260609/`.
119
+
120
+
=== The Cockburn Anchor Fires Universally
121
+
122
+
In framing C ("fully-dressed use cases according to Cockburn"), every model produces the full apparatus — Primary Actor, Stakeholders & Interests, Main Success Scenario, Extensions, goal level — regardless of vendor or size. The dense prior transcends model families.
123
+
124
+
=== Three Failure Modes
125
+
126
+
Where the original experiment found two behaviours in framing E (transparent substitution on Opus; silent substitution on Haiku), the cross-model test surfaces a third.
127
+
128
+
==== 1. Transparent substitution
129
+
130
+
The model recognises the gap and says so. GPT-5-mini on probe P5:
131
+
132
+
[quote, GPT-5-mini, P5]
133
+
____
134
+
I'm not certain which "Use‑Case 3.0" you mean — several different authors and communities have used that label. Could you tell me which source or context you're referring to?
135
+
____
136
+
137
+
This is the safest failure — you _know_ the anchor did not fire.
138
+
139
+
==== 2. Silent substitution
140
+
141
+
The model delivers content under the requested label, but the content belongs to an older concept. Gemini 2.5 Flash on framing E prints a document headed "Use-Case 3.0 slices" whose body is recognisably Use-Case 2.0 slice technique. The label says 3.0; the content is 2.0.
142
+
143
+
This confirms the original finding — and shows it is not Claude-specific. Gemini does it too.
144
+
145
+
==== 3. Confabulation
146
+
147
+
The model invents a confident, internally consistent but _fictitious_ description. This is the hardest failure to detect.
148
+
149
+
GPT-5 on probe P5 produces "10 principles of Use-Case 3.0": _Outcome‑first, Usage‑centric, Separate need from solution, Slice for flow, Example‑ and test‑driven, Just‑enough just‑in‑time detail, Scalable and fractal, Single shared model, Holistic quality, Visual and collaborative._
150
+
151
+
Gemini 2.5 Flash goes further — it attributes "10 foundational principles" including _Universally Applicable_, _Focus on the Big Picture_, _Tell the Whole Story_, _Trigger Conversations_, _Prioritize Readability_.
152
+
153
+
**Neither list matches reality.** The published Use-Case Foundation (Jacobson, 2024) lists _nine_ principles with different wording (https://www.ivarjacobson.com/publications/articles/use-cases-ultimate-guide[IJI guide]). The count is wrong, the names are invented. This is not a denser prior — it is fabrication.
154
+
155
+
Confabulation is more dangerous than silent substitution because:
156
+
157
+
* It passes superficial review (numbered principles _look_ authoritative)
158
+
* It creates false confidence
159
+
* It is undetectable without domain knowledge or source verification
160
+
161
+
=== The Anchor Viability Horizon
162
+
163
+
These results point forward. An anchor's viability is not permanent — it tracks the density of the training data at training time.
164
+
165
+
* **Stable anchors** (Cockburn, GoF, SOLID): densely represented, fire reliably across all model families for years to come.
166
+
* **Emerging anchors** (Use-Case 2.0): fire on strong models, degrade on smaller ones. Viability will improve as published corpus grows.
167
+
* **Pre-anchor concepts** (Use-Case 3.0): not yet viable. Too thin in the corpus — they confabulate rather than activate. These belong in *contracts*, not anchors.
168
+
169
+
As models retrain on newer data, the boundary shifts. A concept that confabulates today may become a reliable anchor in a future generation — but only if the published corpus grows to match. The catalog should periodically re-run this battery to track which concepts have crossed the density threshold.
170
+
171
+
=== Summary
172
+
173
+
[cols="1,2,2,2", options="header"]
174
+
|===
175
+
| Framing | GPT-5 | GPT-5-mini | Gemini 2.5 Flash
176
+
177
+
| *A — no anchor*
178
+
| Generic requirements list.
179
+
| Generic requirements list.
180
+
| Sequential phases. No use-case structure.
181
+
182
+
| *B — "use cases"*
183
+
| Full use-case model with actors and specification.
184
+
| Use-case structure with actors and flows.
185
+
| Use-case structure, less detailed than C.
186
+
187
+
| *C — "Cockburn fully-dressed"*
188
+
| Full apparatus: goal level, stakeholders, MSS, extensions, guarantees.
189
+
| Full apparatus.
190
+
| Full apparatus: primary actor, scope, level, stakeholders, MSS, extensions.
191
+
192
+
| *D — "Use-Case 2.0 slices"*
193
+
| Real slices with acceptance criteria.
194
+
| Real slices, incremental.
195
+
| Real slices, walking skeleton first.
196
+
197
+
| *E — "Use-Case 3.0 slices"*
198
+
| Slices labelled "3.0" — content is 2.0. *Confabulates on prior probe.*
199
+
| Slices labelled "3.0" — content is 2.0. *Transparent on prior probe.*
200
+
| Slices labelled "3.0" — content is 2.0. *Confabulates on prior probe.*
201
+
|===
202
+
203
+
=== Implications
204
+
205
+
. The anchor mechanism is *model-family-independent*. The catalog's value holds across vendors.
206
+
. The three-layer model (anchor / contract / article) is not optional — confabulation makes weak-prior terms actively dangerous as anchors.
207
+
. The catalog needs a *viability test* per anchor: can the term be confabulated? If yes, it is not ready.
208
+
104
209
== Why the Anchor Lags Real Practice
105
210
106
211
There is a second gap, and Simon named it. A model's knowledge is a snapshot of what people _wrote down_, weighted by how much they wrote. Day-to-day practice is mostly not written down: that teams skip the fully-dressed ceremony, that they treat Cockburn and Jacobson as one thing, that user stories absorbed much of the job. None of that sits in the corpus at volume, so none of it shapes the prior.
0 commit comments