Skip to content

Commit 5639077

Browse files
Release v1.0.14 semantic harness checker
1 parent a19b087 commit 5639077

9 files changed

Lines changed: 304 additions & 34 deletions

File tree

docs/releases/v1.0.14.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# v1.0.14 - Semantic harness checker
2+
3+
This release publishes the latest AHE improvement from the local OpenCode
4+
harness into the public starter kit, while keeping local providers, private MCP
5+
configuration, credentials, raw transcripts, and machine-specific paths out of
6+
the repository.
7+
8+
## Highlights
9+
10+
- Replaced brittle literal prompt-text checks in `opencode/scripts/check-harness.mjs`
11+
with bounded semantic regex checks for lead-router contract prose.
12+
- Kept structural checks exact for security-sensitive permissions and agent
13+
identifiers, including `edit: deny`, shell allowlist entries, and agent names.
14+
- Added literal fallbacks for the most critical lead prohibition checks so the
15+
checker remains conservative where it matters.
16+
- Added public-safe AHE summaries for iterations 014 and 015, covering the
17+
baseline diagnosis, manifest, validation, and keep decision.
18+
19+
## Public-safety notes
20+
21+
- No local provider blocks are included.
22+
- No local MCP configuration is included.
23+
- No credentials, auth files, or private environment variables are included.
24+
- No raw transcripts are included.
25+
- No machine-specific absolute paths are included in the new public AHE records.
26+
27+
## Validation
28+
29+
Validated locally with:
30+
31+
```bash
32+
node --check opencode/scripts/check-harness.mjs
33+
node scripts/check-harness.mjs
34+
./scripts/check.sh
35+
git diff --check
36+
```
37+
38+
Additional targeted smokes verified that equivalent wording rewrites pass while
39+
real omissions and `edit: deny` permission changes still fail.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Token vs Semantics
2+
3+
## Structural Tokens
4+
5+
These remain exact literals:
6+
7+
- `edit: deny`
8+
- `"cd": allow`
9+
- `"cd *": allow`
10+
- `"which": allow`
11+
- `"which *": allow`
12+
- agent identifiers such as `developer`, `researcher`, `designer`, and
13+
`specifier`
14+
15+
## Semantic Tokens
16+
17+
These are prompt-contract concepts and can be verified with bounded regexes:
18+
19+
- fast router behavior;
20+
- asking the user when real ambiguity changes routing;
21+
- lead must not edit code;
22+
- lead must not develop or deeply investigate code;
23+
- implementation corrections return to `developer`;
24+
- handoffs are self-contained;
25+
- diff review belongs to `reviewer`;
26+
- discovery is delegated to `researcher`.
27+
28+
## Acceptance Bar
29+
30+
The semantic checker is acceptable only if it passes equivalent rewrites while
31+
still failing real omissions and structural permission changes.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Analysis: iteration-014-harness-baseline
2+
3+
The useful root cause was checker brittleness, not agent behavior. The checker
4+
was using exact `String.includes()` checks for rules that are semantic prompt
5+
contracts, such as "Do not edit code" and "delegate to `researcher`".
6+
7+
The recommended change was narrow:
8+
9+
- keep YAML permissions and agent names as exact literals;
10+
- replace semantic phrase tokens with bounded regex checks;
11+
- keep literal fallbacks for the most critical prohibition rules;
12+
- do not change routing, providers, models, permissions, or agent behavior.
13+
14+
No public raw transcripts or machine-local paths are required for this evidence.
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Iteration 014 - Harness Baseline
2+
3+
## Status
4+
5+
- Checker baseline: `node scripts/check-harness.mjs` passed before the semantic-checker change.
6+
- Evidence type: `static_contract`.
7+
- Scope: inspect prior AHE runs and identify low-risk harness improvement opportunities.
8+
9+
## Findings
10+
11+
1. Runtime replay coverage is still limited across older AHE iterations.
12+
2. `check-harness.mjs` used literal phrase checks for semantic lead-router rules.
13+
3. Manual benchmark scenarios exist, but not all are executed systematically.
14+
4. `evolution_history.md` is not a complete iteration index.
15+
16+
## Selected Opportunity
17+
18+
Only finding 2 justified an immediate harness change. Literal phrase checks made
19+
the checker brittle when prompt text was rewritten with equivalent meaning. The
20+
narrowest improvement was to keep structural tokens exact while replacing
21+
semantic phrase checks with flexible regex patterns.
22+
23+
## Deferred Opportunities
24+
25+
- Execute more `transcript_replay` scenarios before changing routing behavior.
26+
- Complete the evolution history/index in a separate documentation pass.
27+
- Improve regex coverage only when a concrete false negative is found.
28+
29+
## Validation Recommendation
30+
31+
The next iteration should verify that:
32+
33+
- current documents still pass the checker;
34+
- equivalent wording rewrites pass;
35+
- real omissions still fail;
36+
- structural permissions and agent identifiers remain exact.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Analysis: iteration-015-semantic-checker
2+
3+
Post-change validation confirms the checker improvement should be kept.
4+
5+
The main review risk was false confidence: semantic regexes can become too broad
6+
or too narrow. The public implementation mitigates this by:
7+
8+
- keeping structural permissions and agent identifiers as exact literals;
9+
- keeping literal fallbacks for critical lead prohibition checks;
10+
- adding rewrite and omission smoke tests during validation;
11+
- keeping the change scoped to `scripts/check-harness.mjs`.
12+
13+
No routing, model, provider, MCP, permission, or agent behavior changed.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"iteration": 15,
3+
"evaluates_iteration": 15,
4+
"results": [
5+
{
6+
"change_id": "chg-1-semantic-checker",
7+
"predicted_fixes_confirmed": [
8+
"Equivalent wording rewrites pass the checker.",
9+
"Real rule omissions still fail the checker.",
10+
"Structural permission and agent identifier checks remain exact.",
11+
"No routing, provider, model, MCP, credential, or agent behavior is changed."
12+
],
13+
"predicted_fixes_not_confirmed": [],
14+
"risk_tasks_regressed": [],
15+
"risk_tasks_not_regressed": [
16+
"Do not make regex patterns broad enough to pass unrelated text.",
17+
"Keep literal fallbacks for the most critical lead prohibition checks.",
18+
"Keep permissions and agent identifiers as exact literals.",
19+
"Do not include private providers, MCPs, credentials, local paths, or raw transcripts."
20+
],
21+
"unpredicted_regressions": [],
22+
"decision": "keep",
23+
"evidence": [
24+
"docs/ai/evolution/runs/iteration-015-semantic-checker/evaluation.md",
25+
"docs/ai/evolution/runs/iteration-015-semantic-checker/change_manifest.json",
26+
"scripts/check-harness.mjs"
27+
],
28+
"notes": "The public sync ports the local semantic-checker improvement to the English packaged harness and includes only sanitized summary evidence."
29+
}
30+
]
31+
}
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
{
2+
"iteration": 15,
3+
"changes": [
4+
{
5+
"id": "chg-1-semantic-checker",
6+
"type": "improvement",
7+
"description": "Replace brittle literal prompt-text checks in check-harness.mjs with bounded semantic regex checks while preserving exact matching for structural permissions and agent identifiers.",
8+
"files": [
9+
"scripts/check-harness.mjs"
10+
],
11+
"failure_pattern": "The checker failed when lead-router contract text was rewritten with equivalent wording, even though the rule meaning was preserved.",
12+
"evidence": [
13+
"docs/ai/evolution/runs/iteration-014-harness-baseline/evaluation.md",
14+
"docs/ai/evolution/runs/iteration-014-harness-baseline/analysis/overview.md",
15+
"docs/ai/evolution/runs/iteration-014-harness-baseline/analysis/detail/token-vs-semantic.md"
16+
],
17+
"root_cause": "The verification mechanism treated prompt prose as exact tokens instead of semantic contract rules.",
18+
"component_touched": "tool",
19+
"predicted_fixes": [
20+
"Equivalent wording rewrites pass the checker.",
21+
"Real rule omissions still fail the checker.",
22+
"Structural permission and agent identifier checks remain exact.",
23+
"No routing, provider, model, MCP, credential, or agent behavior is changed."
24+
],
25+
"risk_tasks": [
26+
"Do not make regex patterns broad enough to pass unrelated text.",
27+
"Keep literal fallbacks for the most critical lead prohibition checks.",
28+
"Keep permissions and agent identifiers as exact literals.",
29+
"Do not include private providers, MCPs, credentials, local paths, or raw transcripts."
30+
],
31+
"constraint_level": "tool",
32+
"why_this_component": "The defect is in the mechanical checker, not in agent routing or model configuration.",
33+
"post_change_validation": [
34+
"Run `node --check scripts/check-harness.mjs`.",
35+
"Run `node scripts/check-harness.mjs` from the packaged opencode directory.",
36+
"Run semantic rewrite smoke tests on temporary copies.",
37+
"Run omission smoke tests on temporary copies.",
38+
"Run `git diff --check`."
39+
],
40+
"decision_criteria": {
41+
"keep": "Checker passes current docs, passes equivalent rewrites, fails real omissions, and keeps structural checks exact.",
42+
"improve": "Some equivalent rewrites still fail or some real omissions pass.",
43+
"rollback+pivot": "Regex checks become less reliable than literal checks."
44+
}
45+
}
46+
]
47+
}
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Iteration 015 - Semantic Checker Evaluation
2+
3+
## Result
4+
5+
Decision: `keep`.
6+
7+
The public checker now preserves exact structural checks and uses bounded
8+
semantic regex checks for lead-router prompt prose.
9+
10+
## Scenarios
11+
12+
| Scenario | Expected | Actual |
13+
| --- | --- | --- |
14+
| `node --check scripts/check-harness.mjs` | pass | pass |
15+
| `node scripts/check-harness.mjs` | pass | pass |
16+
| Equivalent lead wording rewrite | pass | pass |
17+
| Equivalent `commands.md` wording rewrite | pass | pass |
18+
| Remove "Do not edit code" rule | fail | fail |
19+
| Change `edit: deny` to `edit: ask` | fail | fail |
20+
| `git diff --check` | pass | pass |
21+
22+
## Notes
23+
24+
- Bounded regexes are intentionally not a full natural-language parser.
25+
- Critical lead prohibition checks keep literal fallbacks.
26+
- No public artifact includes raw transcripts, private providers, MCP config,
27+
credentials, or local machine paths.

opencode/scripts/check-harness.mjs

Lines changed: 66 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,28 @@ function checkConfig() {
121121
}
122122
}
123123

124+
/**
125+
* Verify that all given regex patterns match the text.
126+
* Each pattern is either a RegExp, a string treated as literal, or an object
127+
* with a regex and optional literal fallbacks.
128+
*/
129+
function checkSemantic(text, patterns, label) {
130+
for (let i = 0; i < patterns.length; i++) {
131+
const p = patterns[i];
132+
const re =
133+
p instanceof RegExp
134+
? p
135+
: p.regex instanceof RegExp
136+
? p.regex
137+
: new RegExp("\\b" + p.replace(/[.*+?^${}()|[\]\\]/g, "\\$&") + "\\b", "i");
138+
const fallbackLiterals = p.fallbackLiterals ?? [];
139+
if (!re.test(text) && !fallbackLiterals.some((token) => text.includes(token))) {
140+
const desc = p instanceof RegExp ? p.source : p.description ?? p.regex?.source ?? p;
141+
fail(`${label}[${i}]: no match for /${desc}/`);
142+
}
143+
}
144+
}
145+
124146
function checkLeadRouterContract() {
125147
const text = read("agents/lead.md");
126148
const frontmatter = frontmatterBlock("agents/lead.md");
@@ -141,53 +163,63 @@ function checkLeadRouterContract() {
141163
}
142164

143165
for (const token of [
144-
"fast router",
145166
"developer",
146167
"researcher",
147168
"designer",
148169
"specifier",
149-
"Ask the user",
150-
"real ambiguity",
151-
"Do not edit code",
152-
"whole loop of the same free-form request",
153-
"bounded task back to `developer`",
154-
"implementation correction goes back to `developer`",
155-
"does not develop",
156-
"does not deeply",
157-
"minimum",
158-
"context needed to",
159-
"delegate to `researcher`",
170+
"`researcher`",
160171
"`reviewer`",
161-
"Do not mentally implement the solution before delegating",
162-
"handoff to another",
163-
"must be self-contained",
164-
"Do not review a diff yourself",
165172
]) {
166173
if (!text.includes(token)) fail(`agents/lead.md: missing ${token}`);
167174
}
168175

176+
checkSemantic(text, [
177+
/\b(fast|quick|rapid|agile|lightweight)\s+router\b/i,
178+
/\b(ask|consult)\s+the\s+user\b/i,
179+
/\b(real|genuine|material|meaningful)\s+ambiguity\b/i,
180+
{
181+
regex: /\bdo\s+not\s+(edit|modify|change|alter)\s+code\b/i,
182+
fallbackLiterals: ["Do not edit code"],
183+
},
184+
/\b(whole|entire)\s+(loop|cycle)\s+of\s+the\s+same\s+free-form\s+request\b/i,
185+
/\b(send|return|route|pass)\s+a\s+bounded\s+(task|correction)\s+back\s+to\s+\`developer\`/i,
186+
/\b(implementation\s+)?(correction|fix|adjustment|change)\s+goes\s+back\s+to\s+\`developer\`/i,
187+
{
188+
regex: /\bdoes\s+not\s+(develop|code|implement|program)\b/i,
189+
fallbackLiterals: ["does not develop"],
190+
},
191+
{
192+
regex: /\bdoes\s+not\s+(deeply\s+)?(investigate|inspect|analyze|analyse)\s+code\b/i,
193+
fallbackLiterals: ["does not deeply"],
194+
},
195+
/\bminimum\s+context\s+needed\s+to\b/i,
196+
/\bdelegate\s+to\s+\`researcher\`/i,
197+
/\bdo\s+not\s+(mentally\s+)?(implement|design|solve|build)\s+the\s+(solution|problem)\s+before\s+delegating\b/i,
198+
/\bhandoff\s+to\s+another\b/i,
199+
/\bmust\s+be\s+self-contained\b/i,
200+
/\bdo\s+not\s+(review|inspect)\s+a\s+diff\s+yourself\b/i,
201+
], "agents/lead.md semantic invariant");
202+
169203
const docs = read("docs/ai/harness/agents.md");
170-
for (const token of [
171-
"`lead` does not edit files",
172-
"later adjustments for that",
173-
"same free-form request go back to `developer`",
174-
"`lead` does not develop",
175-
"delegates substantive discovery to `researcher`",
176-
"belongs to `reviewer`",
177-
"Every `lead` handoff to another agent must be self-contained",
178-
]) {
179-
if (!docs.includes(token)) fail(`docs/ai/harness/agents.md: missing ${token}`);
204+
if (!docs.includes("belongs to `reviewer`")) {
205+
fail("docs/ai/harness/agents.md: missing belongs to `reviewer`");
180206
}
181207

208+
checkSemantic(docs, [
209+
/\`lead\`\s+does\s+not\s+(edit|modify|change|alter)\s+files\b/i,
210+
/\b(later|subsequent)\s+(adjustments|corrections|fixes|changes)\s+for\s+that\s+same\s+free-form\s+request\s+go\s+back\s+to\s+\`developer\`/i,
211+
/\`lead\`\s+does\s+not\s+(develop|code|implement|program)\b/i,
212+
/\bdelegates\s+(substantive|meaningful|deep)\s+discovery\s+to\s+\`researcher\`/i,
213+
/\bevery\s+\`lead\`\s+handoff\s+to\s+another\s+agent\s+must\s+be\s+self-contained\b/i,
214+
], "docs/ai/harness/agents.md semantic invariant");
215+
182216
const commandDocs = read("docs/ai/harness/commands.md");
183-
for (const token of [
184-
"understand code behavior",
185-
"delegates to `researcher`",
186-
"delegate review to",
187-
"`lead` does not replace `reviewer`",
188-
]) {
189-
if (!commandDocs.includes(token)) fail(`docs/ai/harness/commands.md: missing ${token}`);
190-
}
217+
checkSemantic(commandDocs, [
218+
/\bunderstand\s+(code\s+behavior|how\s+the\s+code\s+works|the\s+code)\b/i,
219+
/\bdelegates\s+to\s+\`researcher\`/i,
220+
/\bdelegate\s+review\s+to\s+\`reviewer\`/i,
221+
/\`lead\`\s+does\s+not\s+replace\s+\`reviewer\`/i,
222+
], "docs/ai/harness/commands.md semantic invariant");
191223
}
192224

193225
function checkFrontmatter() {

0 commit comments

Comments
 (0)