Release v1.0.14 semantic harness checker

juancarlosrodicio · juancarlosrodicio · commit 5639077acb1f · 2026-05-19T14:53:29.000+02:00
diff --git a/docs/releases/v1.0.14.md b/docs/releases/v1.0.14.md
@@ -0,0 +1,39 @@
+# v1.0.14 - Semantic harness checker
+
+This release publishes the latest AHE improvement from the local OpenCode
+harness into the public starter kit, while keeping local providers, private MCP
+configuration, credentials, raw transcripts, and machine-specific paths out of
+the repository.
+
+## Highlights
+
+- Replaced brittle literal prompt-text checks in `opencode/scripts/check-harness.mjs`
+  with bounded semantic regex checks for lead-router contract prose.
+- Kept structural checks exact for security-sensitive permissions and agent
+  identifiers, including `edit: deny`, shell allowlist entries, and agent names.
+- Added literal fallbacks for the most critical lead prohibition checks so the
+  checker remains conservative where it matters.
+- Added public-safe AHE summaries for iterations 014 and 015, covering the
+  baseline diagnosis, manifest, validation, and keep decision.
+
+## Public-safety notes
+
+- No local provider blocks are included.
+- No local MCP configuration is included.
+- No credentials, auth files, or private environment variables are included.
+- No raw transcripts are included.
+- No machine-specific absolute paths are included in the new public AHE records.
+
+## Validation
+
+Validated locally with:
+
+```bash
+node --check opencode/scripts/check-harness.mjs
+node scripts/check-harness.mjs
+./scripts/check.sh
+git diff --check
+```
+
+Additional targeted smokes verified that equivalent wording rewrites pass while
+real omissions and `edit: deny` permission changes still fail.
diff --git a/opencode/docs/ai/evolution/runs/iteration-014-harness-baseline/analysis/detail/token-vs-semantic.md b/opencode/docs/ai/evolution/runs/iteration-014-harness-baseline/analysis/detail/token-vs-semantic.md
@@ -0,0 +1,31 @@
+# Token vs Semantics
+
+## Structural Tokens
+
+These remain exact literals:
+
+- `edit: deny`
+- `"cd": allow`
+- `"cd *": allow`
+- `"which": allow`
+- `"which *": allow`
+- agent identifiers such as `developer`, `researcher`, `designer`, and
+  `specifier`
+
+## Semantic Tokens
+
+These are prompt-contract concepts and can be verified with bounded regexes:
+
+- fast router behavior;
+- asking the user when real ambiguity changes routing;
+- lead must not edit code;
+- lead must not develop or deeply investigate code;
+- implementation corrections return to `developer`;
+- handoffs are self-contained;
+- diff review belongs to `reviewer`;
+- discovery is delegated to `researcher`.
+
+## Acceptance Bar
+
+The semantic checker is acceptable only if it passes equivalent rewrites while
+still failing real omissions and structural permission changes.
diff --git a/opencode/docs/ai/evolution/runs/iteration-014-harness-baseline/analysis/overview.md b/opencode/docs/ai/evolution/runs/iteration-014-harness-baseline/analysis/overview.md
@@ -0,0 +1,14 @@
+# Analysis: iteration-014-harness-baseline
+
+The useful root cause was checker brittleness, not agent behavior. The checker
+was using exact `String.includes()` checks for rules that are semantic prompt
+contracts, such as "Do not edit code" and "delegate to `researcher`".
+
+The recommended change was narrow:
+
+- keep YAML permissions and agent names as exact literals;
+- replace semantic phrase tokens with bounded regex checks;
+- keep literal fallbacks for the most critical prohibition rules;
+- do not change routing, providers, models, permissions, or agent behavior.
+
+No public raw transcripts or machine-local paths are required for this evidence.
diff --git a/opencode/docs/ai/evolution/runs/iteration-014-harness-baseline/evaluation.md b/opencode/docs/ai/evolution/runs/iteration-014-harness-baseline/evaluation.md
@@ -0,0 +1,36 @@
+# Iteration 014 - Harness Baseline
+
+## Status
+
+- Checker baseline: `node scripts/check-harness.mjs` passed before the semantic-checker change.
+- Evidence type: `static_contract`.
+- Scope: inspect prior AHE runs and identify low-risk harness improvement opportunities.
+
+## Findings
+
+1. Runtime replay coverage is still limited across older AHE iterations.
+2. `check-harness.mjs` used literal phrase checks for semantic lead-router rules.
+3. Manual benchmark scenarios exist, but not all are executed systematically.
+4. `evolution_history.md` is not a complete iteration index.
+
+## Selected Opportunity
+
+Only finding 2 justified an immediate harness change. Literal phrase checks made
+the checker brittle when prompt text was rewritten with equivalent meaning. The
+narrowest improvement was to keep structural tokens exact while replacing
+semantic phrase checks with flexible regex patterns.
+
+## Deferred Opportunities
+
+- Execute more `transcript_replay` scenarios before changing routing behavior.
+- Complete the evolution history/index in a separate documentation pass.
+- Improve regex coverage only when a concrete false negative is found.
+
+## Validation Recommendation
+
+The next iteration should verify that:
+
+- current documents still pass the checker;
+- equivalent wording rewrites pass;
+- real omissions still fail;
+- structural permissions and agent identifiers remain exact.
diff --git a/opencode/docs/ai/evolution/runs/iteration-015-semantic-checker/analysis/overview.md b/opencode/docs/ai/evolution/runs/iteration-015-semantic-checker/analysis/overview.md
@@ -0,0 +1,13 @@
+# Analysis: iteration-015-semantic-checker
+
+Post-change validation confirms the checker improvement should be kept.
+
+The main review risk was false confidence: semantic regexes can become too broad
+or too narrow. The public implementation mitigates this by:
+
+- keeping structural permissions and agent identifiers as exact literals;
+- keeping literal fallbacks for critical lead prohibition checks;
+- adding rewrite and omission smoke tests during validation;
+- keeping the change scoped to `scripts/check-harness.mjs`.
+
+No routing, model, provider, MCP, permission, or agent behavior changed.
diff --git a/opencode/docs/ai/evolution/runs/iteration-015-semantic-checker/change_evaluation.json b/opencode/docs/ai/evolution/runs/iteration-015-semantic-checker/change_evaluation.json
@@ -0,0 +1,31 @@
+{
+  "iteration": 15,
+  "evaluates_iteration": 15,
+  "results": [
+    {
+      "change_id": "chg-1-semantic-checker",
+      "predicted_fixes_confirmed": [
+        "Equivalent wording rewrites pass the checker.",
+        "Real rule omissions still fail the checker.",
+        "Structural permission and agent identifier checks remain exact.",
+        "No routing, provider, model, MCP, credential, or agent behavior is changed."
+      ],
+      "predicted_fixes_not_confirmed": [],
+      "risk_tasks_regressed": [],
+      "risk_tasks_not_regressed": [
+        "Do not make regex patterns broad enough to pass unrelated text.",
+        "Keep literal fallbacks for the most critical lead prohibition checks.",
+        "Keep permissions and agent identifiers as exact literals.",
+        "Do not include private providers, MCPs, credentials, local paths, or raw transcripts."
+      ],
+      "unpredicted_regressions": [],
+      "decision": "keep",
+      "evidence": [
+        "docs/ai/evolution/runs/iteration-015-semantic-checker/evaluation.md",
+        "docs/ai/evolution/runs/iteration-015-semantic-checker/change_manifest.json",
+        "scripts/check-harness.mjs"
+      ],
+      "notes": "The public sync ports the local semantic-checker improvement to the English packaged harness and includes only sanitized summary evidence."
+    }
+  ]
+}
diff --git a/opencode/docs/ai/evolution/runs/iteration-015-semantic-checker/change_manifest.json b/opencode/docs/ai/evolution/runs/iteration-015-semantic-checker/change_manifest.json
@@ -0,0 +1,47 @@
+{
+  "iteration": 15,
+  "changes": [
+    {
+      "id": "chg-1-semantic-checker",
+      "type": "improvement",
+      "description": "Replace brittle literal prompt-text checks in check-harness.mjs with bounded semantic regex checks while preserving exact matching for structural permissions and agent identifiers.",
+      "files": [
+        "scripts/check-harness.mjs"
+      ],
+      "failure_pattern": "The checker failed when lead-router contract text was rewritten with equivalent wording, even though the rule meaning was preserved.",
+      "evidence": [
+        "docs/ai/evolution/runs/iteration-014-harness-baseline/evaluation.md",
+        "docs/ai/evolution/runs/iteration-014-harness-baseline/analysis/overview.md",
+        "docs/ai/evolution/runs/iteration-014-harness-baseline/analysis/detail/token-vs-semantic.md"
+      ],
+      "root_cause": "The verification mechanism treated prompt prose as exact tokens instead of semantic contract rules.",
+      "component_touched": "tool",
+      "predicted_fixes": [
+        "Equivalent wording rewrites pass the checker.",
+        "Real rule omissions still fail the checker.",
+        "Structural permission and agent identifier checks remain exact.",
+        "No routing, provider, model, MCP, credential, or agent behavior is changed."
+      ],
+      "risk_tasks": [
+        "Do not make regex patterns broad enough to pass unrelated text.",
+        "Keep literal fallbacks for the most critical lead prohibition checks.",
+        "Keep permissions and agent identifiers as exact literals.",
+        "Do not include private providers, MCPs, credentials, local paths, or raw transcripts."
+      ],
+      "constraint_level": "tool",
+      "why_this_component": "The defect is in the mechanical checker, not in agent routing or model configuration.",
+      "post_change_validation": [
+        "Run `node --check scripts/check-harness.mjs`.",
+        "Run `node scripts/check-harness.mjs` from the packaged opencode directory.",
+        "Run semantic rewrite smoke tests on temporary copies.",
+        "Run omission smoke tests on temporary copies.",
+        "Run `git diff --check`."
+      ],
+      "decision_criteria": {
+        "keep": "Checker passes current docs, passes equivalent rewrites, fails real omissions, and keeps structural checks exact.",
+        "improve": "Some equivalent rewrites still fail or some real omissions pass.",
+        "rollback+pivot": "Regex checks become less reliable than literal checks."
+      }
+    }
+  ]
+}
diff --git a/opencode/docs/ai/evolution/runs/iteration-015-semantic-checker/evaluation.md b/opencode/docs/ai/evolution/runs/iteration-015-semantic-checker/evaluation.md
@@ -0,0 +1,27 @@
+# Iteration 015 - Semantic Checker Evaluation
+
+## Result
+
+Decision: `keep`.
+
+The public checker now preserves exact structural checks and uses bounded
+semantic regex checks for lead-router prompt prose.
+
+## Scenarios
+
+| Scenario | Expected | Actual |
+| --- | --- | --- |
+| `node --check scripts/check-harness.mjs` | pass | pass |
+| `node scripts/check-harness.mjs` | pass | pass |
+| Equivalent lead wording rewrite | pass | pass |
+| Equivalent `commands.md` wording rewrite | pass | pass |
+| Remove "Do not edit code" rule | fail | fail |
+| Change `edit: deny` to `edit: ask` | fail | fail |
+| `git diff --check` | pass | pass |
+
+## Notes
+
+- Bounded regexes are intentionally not a full natural-language parser.
+- Critical lead prohibition checks keep literal fallbacks.
+- No public artifact includes raw transcripts, private providers, MCP config,
+  credentials, or local machine paths.
diff --git a/opencode/scripts/check-harness.mjs b/opencode/scripts/check-harness.mjs
@@ -121,6 +121,28 @@ function checkConfig() {
   }
 }
 
+/**
+ * Verify that all given regex patterns match the text.
+ * Each pattern is either a RegExp, a string treated as literal, or an object
+ * with a regex and optional literal fallbacks.
+ */
+function checkSemantic(text, patterns, label) {
+  for (let i = 0; i < patterns.length; i++) {
+    const p = patterns[i];
+    const re =
+      p instanceof RegExp
+        ? p
+        : p.regex instanceof RegExp
+          ? p.regex
+          : new RegExp("\\b" + p.replace(/[.*+?^${}()|[\]\\]/g, "\\$&") + "\\b", "i");
+    const fallbackLiterals = p.fallbackLiterals ?? [];
+    if (!re.test(text) && !fallbackLiterals.some((token) => text.includes(token))) {
+      const desc = p instanceof RegExp ? p.source : p.description ?? p.regex?.source ?? p;
+      fail(`${label}[${i}]: no match for /${desc}/`);
+    }
+  }
+}
+
 function checkLeadRouterContract() {
   const text = read("agents/lead.md");
   const frontmatter = frontmatterBlock("agents/lead.md");
@@ -141,53 +163,63 @@ function checkLeadRouterContract() {
   }
 
   for (const token of [
-    "fast router",
     "developer",
     "researcher",
     "designer",
     "specifier",
-    "Ask the user",
-    "real ambiguity",
-    "Do not edit code",
-    "whole loop of the same free-form request",
-    "bounded task back to `developer`",
-    "implementation correction goes back to `developer`",
-    "does not develop",
-    "does not deeply",
-    "minimum",
-    "context needed to",
-    "delegate to `researcher`",
+    "`researcher`",
     "`reviewer`",
-    "Do not mentally implement the solution before delegating",
-    "handoff to another",
-    "must be self-contained",
-    "Do not review a diff yourself",
   ]) {
     if (!text.includes(token)) fail(`agents/lead.md: missing ${token}`);
   }
 
+  checkSemantic(text, [
+    /\b(fast|quick|rapid|agile|lightweight)\s+router\b/i,
+    /\b(ask|consult)\s+the\s+user\b/i,
+    /\b(real|genuine|material|meaningful)\s+ambiguity\b/i,
+    {
+      regex: /\bdo\s+not\s+(edit|modify|change|alter)\s+code\b/i,
+      fallbackLiterals: ["Do not edit code"],
+    },
+    /\b(whole|entire)\s+(loop|cycle)\s+of\s+the\s+same\s+free-form\s+request\b/i,
+    /\b(send|return|route|pass)\s+a\s+bounded\s+(task|correction)\s+back\s+to\s+\`developer\`/i,
+    /\b(implementation\s+)?(correction|fix|adjustment|change)\s+goes\s+back\s+to\s+\`developer\`/i,
+    {
+      regex: /\bdoes\s+not\s+(develop|code|implement|program)\b/i,
+      fallbackLiterals: ["does not develop"],
+    },
+    {
+      regex: /\bdoes\s+not\s+(deeply\s+)?(investigate|inspect|analyze|analyse)\s+code\b/i,
+      fallbackLiterals: ["does not deeply"],
+    },
+    /\bminimum\s+context\s+needed\s+to\b/i,
+    /\bdelegate\s+to\s+\`researcher\`/i,
+    /\bdo\s+not\s+(mentally\s+)?(implement|design|solve|build)\s+the\s+(solution|problem)\s+before\s+delegating\b/i,
+    /\bhandoff\s+to\s+another\b/i,
+    /\bmust\s+be\s+self-contained\b/i,
+    /\bdo\s+not\s+(review|inspect)\s+a\s+diff\s+yourself\b/i,
+  ], "agents/lead.md semantic invariant");
+
   const docs = read("docs/ai/harness/agents.md");
-  for (const token of [
-    "`lead` does not edit files",
-    "later adjustments for that",
-    "same free-form request go back to `developer`",
-    "`lead` does not develop",
-    "delegates substantive discovery to `researcher`",
-    "belongs to `reviewer`",
-    "Every `lead` handoff to another agent must be self-contained",
-  ]) {
-    if (!docs.includes(token)) fail(`docs/ai/harness/agents.md: missing ${token}`);
+  if (!docs.includes("belongs to `reviewer`")) {
+    fail("docs/ai/harness/agents.md: missing belongs to `reviewer`");
   }
 
+  checkSemantic(docs, [
+    /\`lead\`\s+does\s+not\s+(edit|modify|change|alter)\s+files\b/i,
+    /\b(later|subsequent)\s+(adjustments|corrections|fixes|changes)\s+for\s+that\s+same\s+free-form\s+request\s+go\s+back\s+to\s+\`developer\`/i,
+    /\`lead\`\s+does\s+not\s+(develop|code|implement|program)\b/i,
+    /\bdelegates\s+(substantive|meaningful|deep)\s+discovery\s+to\s+\`researcher\`/i,
+    /\bevery\s+\`lead\`\s+handoff\s+to\s+another\s+agent\s+must\s+be\s+self-contained\b/i,
+  ], "docs/ai/harness/agents.md semantic invariant");
+
   const commandDocs = read("docs/ai/harness/commands.md");
-  for (const token of [
-    "understand code behavior",
-    "delegates to `researcher`",
-    "delegate review to",
-    "`lead` does not replace `reviewer`",
-  ]) {
-    if (!commandDocs.includes(token)) fail(`docs/ai/harness/commands.md: missing ${token}`);
-  }
+  checkSemantic(commandDocs, [
+    /\bunderstand\s+(code\s+behavior|how\s+the\s+code\s+works|the\s+code)\b/i,
+    /\bdelegates\s+to\s+\`researcher\`/i,
+    /\bdelegate\s+review\s+to\s+\`reviewer\`/i,
+    /\`lead\`\s+does\s+not\s+replace\s+\`reviewer\`/i,
+  ], "docs/ai/harness/commands.md semantic invariant");
 }
 
 function checkFrontmatter() {