|
| 1 | +## Codebase Patterns |
| 2 | + |
| 3 | +### Verifier file structure |
| 4 | +- Each task has: benchmarks/{suite}/{task_name}/tests/test.sh |
| 5 | +- Many tasks also have tests/eval.sh — test.sh is a thin wrapper that does `exec bash "$(dirname "$0")/eval.sh" "$@"` |
| 6 | +- SG-only tasks have tests/sgonly_verifier_wrapper.sh that restores repos before verification |
| 7 | + |
| 8 | +### Standard test.sh wrapper pattern (most common) |
| 9 | +```bash |
| 10 | +#!/bin/bash |
| 11 | +# test.sh — Harbor compatibility wrapper |
| 12 | +[ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ] && source /tests/sgonly_verifier_wrapper.sh |
| 13 | +exec bash "$(dirname "$0")/eval.sh" "$@" |
| 14 | +``` |
| 15 | + |
| 16 | +### Standard eval.sh pattern (oracle-based tasks) |
| 17 | +```bash |
| 18 | +#!/bin/bash |
| 19 | +set -eo pipefail |
| 20 | +ANSWER_PATH="/workspace/answer.json" |
| 21 | +REWARD_FILE="/logs/verifier/reward.txt" |
| 22 | +mkdir -p "$(dirname "$REWARD_FILE")" |
| 23 | + |
| 24 | +# ... validation logic ... |
| 25 | +python3 /tests/oracle_checks.py --answer "$ANSWER_PATH" --spec /tests/task_spec.json |
| 26 | +SCORE=$? |
| 27 | +# Write reward |
| 28 | +echo "$SCORE" > "$REWARD_FILE" |
| 29 | +``` |
| 30 | + |
| 31 | +### Reward output convention |
| 32 | +- Write float 0.0-1.0 to /logs/verifier/reward.txt |
| 33 | +- Some older tasks write to reward.txt in current dir |
| 34 | +- Harbor reads from /logs/verifier/reward.txt |
| 35 | + |
| 36 | +### The 2 broken verifiers |
| 37 | +- k8s-rbac-auth-audit-001 (csb_sdlc_secure): test.sh lacks set -e, doesn't write reward.txt |
| 38 | +- grafana-platform-orient-001 (csb_sdlc_understand): test.sh lacks set -e, doesn't write reward.txt |
| 39 | + |
| 40 | +### Language field normalization |
| 41 | +- Canonical values: go, java, cpp, python, typescript, rust, csharp, javascript |
| 42 | +- Non-canonical values found and normalized: "c++"→cpp, "c#"→csharp, "java/c++"→java,cpp, "c++/c/javascript"→cpp,c,javascript, "go,protobuf"→go, "mixed"→typescript,go, "unknown"→determined per-task |
| 43 | +- "c" kept as canonical (curl, linux kernel, postgres are genuinely C, not C++) |
| 44 | +- task.toml format: language = "value" (quoted string in TOML) |
| 45 | + |
| 46 | +### ABC audit command |
| 47 | +- python3 scripts/abc_audit.py --suite {suite} --format table |
| 48 | +- python3 scripts/abc_audit.py --all --format table |
| 49 | +- Grade A = all critical pass + >=80% important pass |
| 50 | +- Grade D = any critical fail |
| 51 | + |
| 52 | +## Progress |
| 53 | + |
| 54 | +## 2026-03-07 - US-001 |
| 55 | +- k8s-rbac-auth-audit-001 verifier already had set -eo pipefail and writes /logs/verifier/reward.txt |
| 56 | +- No changes needed — csb_sdlc_secure already at Grade A |
| 57 | +- Files changed: none |
| 58 | +- **Learnings for future iterations:** |
| 59 | + - Always verify current state before assuming something is broken |
| 60 | + - The k8s task uses verifier_lib.sh + IR pipeline pattern (different from checklist pattern) |
| 61 | +--- |
| 62 | + |
| 63 | +## 2026-03-07 - US-002 |
| 64 | +- Replaced thin wrapper test.sh (exec to non-existent eval.sh) with standard checklist verifier |
| 65 | +- Copied exact pattern from argocd-arch-orient-001/tests/test.sh (sibling task) |
| 66 | +- Files changed: benchmarks/csb_sdlc_understand/grafana-platform-orient-001/tests/test.sh |
| 67 | +- **Learnings for future iterations:** |
| 68 | + - The understand suite uses two verifier patterns: checklist (most tasks) and IR pipeline (k8s) |
| 69 | + - ground_truth.json in understand suite has files/symbols format, NOT required_findings format |
| 70 | + - The checklist verifier degrades gracefully when GT lacks required_findings (scores 0 from empty lists) |
| 71 | + - _get_primary_verifier checks if test.sh is an eval wrapper; if eval.sh is missing, it falls back to test.sh |
| 72 | +--- |
| 73 | + |
| 74 | +## 2026-03-07 - US-003 |
| 75 | +- Normalized language field in 94 task.toml files across all suites |
| 76 | +- Bulk replacements: c++→cpp (60), c#→csharp (1), java/c++→java,cpp (8), c++/c/javascript→cpp,c,javascript (3), go,protobuf→go (1), mixed→typescript,go (1) |
| 77 | +- Individually determined 20 "unknown" tasks: 17→go, 2→python, 1→javascript (via instruction.md/Dockerfile analysis) |
| 78 | +- Files changed: 94 task.toml files across benchmarks/csb_org_* and benchmarks/csb_sdlc_* |
| 79 | +- **Learnings for future iterations:** |
| 80 | + - xargs with sed doesn't work reliably for `+` chars in patterns; use `while read f; do sed ... "$f"; done` instead |
| 81 | + - "c" is a legitimate canonical value (10 tasks: curl, linux kernel, postgres) — PRD canonical list omitted it |
| 82 | + - `languages_involved` is a separate TOML array field, not `language` — grep on `^language` avoids false matches |
| 83 | + - Most "unknown" Org tasks are go (prometheus, kubernetes, etcd, grafana ecosystem) |
| 84 | + - Branch management: verify branch AFTER checkout, not just before — stash+checkout can silently fail |
| 85 | +--- |
0 commit comments