Skip to content

Commit 6cc28c7

Browse files
sjarmakclaude
andcommitted
feat: [US-005] - Implement O.b negated-solution check
Also integrates O.a equivalent-solution check from parallel branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f2a0b9e commit 6cc28c7

File tree

3 files changed

+144
-11
lines changed

3 files changed

+144
-11
lines changed

ralph-abc-checks/prd.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@
6363
"python3 scripts/abc_audit.py --suite csb_sdlc_fix --format table 2>&1 | grep O.a shows PASS or WARN (not SKIP)"
6464
],
6565
"priority": 4,
66-
"passes": false,
66+
"passes": true,
6767
"notes": "O.a is CRITICAL severity but this is heuristic analysis. Return WARN for tasks that need manual review, PASS when verifiers clearly use flexible matching."
6868
},
6969
{
@@ -78,7 +78,7 @@
7878
"python3 scripts/abc_audit.py --suite csb_sdlc_fix --format table 2>&1 | grep O.b shows PASS or WARN (not SKIP)"
7979
],
8080
"priority": 5,
81-
"passes": false,
81+
"passes": true,
8282
"notes": "O.b is IMPORTANT severity. Most verifiers use checklist/JSON validation which inherently handles negation. Focus on simple grep-based verifiers."
8383
},
8484
{

ralph-abc-checks/progress.txt

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -51,14 +51,36 @@ def check_xx_name(tasks: list[Path]) -> CriterionResult:
5151

5252
## Progress
5353

54-
## 2026-03-07 - US-003
55-
- Implemented `check_t10_shared_state(tasks)` in abc_audit.py
56-
- Scans Dockerfiles for EXPOSE directives, test scripts for host port bindings (-p), fixed /tmp paths, and named Docker volumes
57-
- T.10 added to TASK_CHECKS, removed from SKIP_CHECKS
58-
- Uses FAIL (not WARN) since T.10 is IMPORTANT severity
54+
## 2026-03-07 - US-004
55+
- Implemented `check_oa_equivalent_solutions(tasks)` in abc_audit.py
56+
- Scans verifier shell scripts for overly-strict matching patterns:
57+
- `grep -Fx` (exact fixed-string line match)
58+
- Exact string equality tests `[ "$var" == "hardcoded" ]`
59+
- `diff` without tolerance flags (-w/-b/--ignore)
60+
- O.a added to TASK_CHECKS, removed from SKIP_CHECKS
61+
- Uses WARN (not FAIL) since this is heuristic analysis per PRD notes
62+
- Skips Python verifiers (they use flexible assertions by nature)
63+
- Allows diff with process substitution `<(` as that's usually flexible
5964
- Files changed: `scripts/abc_audit.py`, `ralph-abc-checks/prd.json`, `ralph-abc-checks/progress.txt`
6065
- **Learnings for future iterations:**
61-
- /tmp paths need careful filtering — `mktemp` and `$()` patterns are safe, only flag alphabetic fixed names like `/tmp/bundles`
62-
- EXPOSE in Dockerfiles is a legitimate shared-state concern (port conflicts between concurrent tasks)
63-
- The regex `/tmp/([a-zA-Z][a-zA-Z0-9_.-]+)` catches fixed paths while skipping variable expansions
66+
- O.a is CRITICAL severity but heuristic → WARN is appropriate for uncertain findings
67+
- `diff` without flags is common but not always bad — process substitution and tolerance flags indicate flexibility
68+
- grep -Fx is the clearest signal of overly-strict matching
69+
- Pre-commit hook runs repo_health.py which checks docs consistency — regenerate with `python3 scripts/refresh_agent_navigation.py` if stale
70+
---
71+
72+
## 2026-03-07 - US-005
73+
- Implemented `check_ob_negated_solutions(tasks)` in abc_audit.py
74+
- Also integrated O.a check (from parallel branch cherry-pick) into current branch
75+
- Scans verifier shell scripts for bare grep with single short keywords that could match negated answers
76+
- Filters out robust patterns: grep with -E/-P/-w/-q/-r/-c/-n flags, multi-word patterns, regex patterns
77+
- Skips greps targeting source code files or log/result paths (structured output)
78+
- O.b added to TASK_CHECKS, removed from SKIP_CHECKS (along with O.a)
79+
- Uses WARN (not FAIL) since O.b is IMPORTANT severity and this is heuristic
80+
- Result: csb_sdlc_fix shows O.b=PASS (no bare keyword greps found — verifiers use structured validation)
81+
- Files changed: `scripts/abc_audit.py`, `ralph-abc-checks/prd.json`, `ralph-abc-checks/progress.txt`
82+
- **Learnings for future iterations:**
83+
- Most verifiers use test frameworks, JSON parsing, or diff-based validation — bare keyword grep is rare
84+
- The O.b check will mostly PASS since the codebase uses robust verification patterns
85+
- Branch divergence: US-004 was committed on a parallel history; had to re-implement O.a on current branch rather than cherry-pick (conflicts)
6486
---

scripts/abc_audit.py

Lines changed: 112 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -947,6 +947,115 @@ def check_t10_shared_state(tasks: list[Path]) -> CriterionResult:
947947
)
948948

949949

950+
def check_oa_equivalent_solutions(tasks: list[Path]) -> CriterionResult:
951+
"""O.a: Verifiers accept functionally equivalent solutions (no overly-strict matching)."""
952+
issues = []
953+
for task_dir in tasks:
954+
verifier = _get_primary_verifier(task_dir)
955+
if not verifier:
956+
continue
957+
958+
content = verifier.read_text(errors="replace")
959+
task_name = task_dir.name
960+
task_issues = []
961+
962+
if verifier.suffix == ".sh":
963+
# Flag grep -Fx (exact fixed-string line match)
964+
if re.search(r"\bgrep\s+.*-[A-Za-z]*F[A-Za-z]*x|grep\s+.*-[A-Za-z]*x[A-Za-z]*F", content):
965+
task_issues.append("grep -Fx (exact fixed-string match)")
966+
967+
# Flag direct string equality tests: [ "$var" = "hardcoded" ] or == "hardcoded"
968+
strict_eq = re.findall(r'\[\s*"\$\w+"\s*==?\s*"([^"]+)"\s*\]', content)
969+
if strict_eq:
970+
task_issues.append(f"exact string comparison against: {', '.join(strict_eq[:3])}")
971+
972+
# Flag diff without any tolerance flags (allow diff -w, diff -b, diff --ignore)
973+
diff_calls = re.finditer(r"\bdiff\s+([^\n|;&]+)", content)
974+
for m in diff_calls:
975+
args = m.group(1)
976+
if re.search(r"-[A-Za-z]*[wbBi]|--ignore|--strip", args):
977+
continue
978+
if "<(" in args:
979+
continue
980+
task_issues.append("diff without tolerance flags (-w/-b/--ignore)")
981+
break
982+
983+
if task_issues:
984+
issues.append(f"{task_name}: {'; '.join(task_issues)}")
985+
986+
if not issues:
987+
return CriterionResult(
988+
criterion_id="O.a", status=Status.PASS,
989+
evidence=f"No overly-strict matching found across {len(tasks)} verifiers",
990+
)
991+
return CriterionResult(
992+
criterion_id="O.a", status=Status.WARN,
993+
evidence="\n".join(issues[:10]),
994+
remediation="Consider using flexible matching (regex, -i flag, tolerance) in verifiers",
995+
details={"issue_count": len(issues), "issues": issues[:20]},
996+
)
997+
998+
999+
def check_ob_negated_solutions(tasks: list[Path]) -> CriterionResult:
1000+
"""O.b: Verifiers reject negated/inverted solutions (no keyword-only matching)."""
1001+
issues = []
1002+
for task_dir in tasks:
1003+
verifier = _get_primary_verifier(task_dir)
1004+
if not verifier or verifier.suffix != ".sh":
1005+
continue
1006+
1007+
content = verifier.read_text(errors="replace")
1008+
task_name = task_dir.name
1009+
task_issues = []
1010+
1011+
# Find bare grep for a single short keyword without robust flags.
1012+
# These could match "NOT keyword" or "the answer is definitely not keyword".
1013+
# Exclude greps with flags: -E (regex), -P (perl), -w (word boundary),
1014+
# -c (count), -r/-R (recursive code search), -l (file list), -q (boolean),
1015+
# -n (line numbers).
1016+
bare_greps = re.finditer(
1017+
r"""grep\s+(?:-[A-Za-z]*\s+)*['"]([^'"]{1,20})['"]\s+(\S+)""",
1018+
content,
1019+
)
1020+
for m in bare_greps:
1021+
keyword = m.group(1).strip()
1022+
target = m.group(2)
1023+
prefix = m.group(0).split(keyword)[0]
1024+
1025+
# Skip multi-word or regex patterns (inherently more specific)
1026+
if re.search(r"[.*+?^${}()|\\[\]]", keyword) or " " in keyword:
1027+
continue
1028+
1029+
# Skip if grep has flags that make matching more robust
1030+
if re.search(r"-[A-Za-z]*[cEPrlRwqn]", prefix):
1031+
continue
1032+
1033+
# Skip if grepping source code files (not agent output)
1034+
if re.search(r"\.(py|js|ts|go|java|rs|c|cpp|sh|rb|yaml|yml|toml|json|md)$", target):
1035+
continue
1036+
1037+
# Skip if target is log/reward/result paths (structured output)
1038+
if re.search(r"/logs/|reward\.|result\.|\.log", target):
1039+
continue
1040+
1041+
task_issues.append(f"bare grep for '{keyword}' could match negated answer")
1042+
1043+
if task_issues:
1044+
issues.append(f"{task_name}: {'; '.join(task_issues[:3])}")
1045+
1046+
if not issues:
1047+
return CriterionResult(
1048+
criterion_id="O.b", status=Status.PASS,
1049+
evidence=f"No keyword-only matching vulnerable to negation across {len(tasks)} verifiers",
1050+
)
1051+
return CriterionResult(
1052+
criterion_id="O.b", status=Status.WARN,
1053+
evidence="\n".join(issues[:10]),
1054+
remediation="Use multi-word patterns, regex with context, or structured JSON validation instead of bare keyword grep",
1055+
details={"issue_count": len(issues), "issues": issues[:20]},
1056+
)
1057+
1058+
9501059
# ---------------------------------------------------------------------------
9511060
# Main auditor
9521061
# ---------------------------------------------------------------------------
@@ -959,6 +1068,8 @@ def check_t10_shared_state(tasks: list[Path]) -> CriterionResult:
9591068
"T.4": check_t4_git_sha,
9601069
"T.5": check_t5_no_solution_leak,
9611070
"T.8": check_t8_oracle_exists,
1071+
"O.a": check_oa_equivalent_solutions,
1072+
"O.b": check_ob_negated_solutions,
9621073
"O.c": check_oc_empty_solution_rejected,
9631074
"O.d": check_od_error_handling,
9641075
"O.e": check_oe_multiple_assertions,
@@ -988,7 +1099,7 @@ def check_t10_shared_state(tasks: list[Path]) -> CriterionResult:
9881099
}
9891100

9901101
# Semi-automated / manual checks (skip with note)
991-
SKIP_CHECKS = {"T.2", "T.9", "O.a", "O.b", "O.f", "O.g", "R.6"}
1102+
SKIP_CHECKS = {"T.2", "T.9", "O.f", "O.g", "R.6"}
9921103

9931104

9941105
def audit_suite(suite: str, dimension: Optional[Dimension] = None) -> AuditReport:

0 commit comments

Comments
 (0)