Skip to content

Commit a1cdc1e

Browse files
sjarmakclaude
andcommitted
feat: US-005, US-006, US-007 - TaskSpec schema, oracle checks, validity gate
- US-004: Verified pre-existing use case registry (100 entries, schema valid) - US-005: Created schemas/mcp_task_spec.schema.json (SWE-Factory + PRDBench) - US-006: Created scripts/ccb_metrics/oracle_checks.py (7 check functions, CLI) - US-007: Created scripts/validate_mcp_task_instance.py (fail2pass gate) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 5c06cf3 commit a1cdc1e

File tree

5 files changed

+1238
-4
lines changed

5 files changed

+1238
-4
lines changed

ralph-mcp-unique/prd.json

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@
120120
"python3 -c \"import json; d=json.load(open('configs/use_case_registry.json')); assert len(d['use_cases'])==100\" succeeds"
121121
],
122122
"priority": 4,
123-
"passes": false,
123+
"passes": true,
124124
"notes": "G-category tasks should have mcp_suite='ccb_mcp_crossorg' (not crosshost). Mark G-category entries as deferred with a note that cross-host is TBD. Category I tasks should have oracle_type='hybrid' (test_ratio + context verification)."
125125
},
126126
{
@@ -139,7 +139,7 @@
139139
"python3 -c \"import json; json.load(open('schemas/mcp_task_spec.schema.json'))\" succeeds"
140140
],
141141
"priority": 5,
142-
"passes": false,
142+
"passes": true,
143143
"notes": "Key change from v1: removed min_recall/min_precision from oracle (thresholds deferred per Q8). Added test_ratio check type for Category I. Added deepsearch_variant section for tasks that have a DS-specific variant."
144144
},
145145
{
@@ -161,7 +161,7 @@
161161
"Stdlib only. python3 -m py_compile succeeds. Doctests pass."
162162
],
163163
"priority": 6,
164-
"passes": false,
164+
"passes": true,
165165
"notes": "Key change from v1: NO pass/fail thresholds. Returns raw recall/precision/F1 scores. The composite score is a simple mean across all configured checks. Exit code 0 means 'agent did something' (composite > 0), exit 1 means 'total failure' (composite == 0). This enables threshold calibration later. Added check_test_ratio for Category I."
166166
},
167167
{
@@ -178,7 +178,7 @@
178178
"python3 -m py_compile succeeds"
179179
],
180180
"priority": 7,
181-
"passes": false,
181+
"passes": true,
182182
"notes": "Same as v1 but adapted for the threshold-free scoring approach. Gold answer must score > 0, empty must score exactly 0."
183183
},
184184
{

ralph-mcp-unique/progress.txt

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,3 +83,76 @@
8383
- SG indexing delay is the bottleneck — create mirrors early, verify later
8484
- `.gitignore` has `ralph-*/` pattern; must use `git add -f` for new files under ralph-mcp-unique/
8585
---
86+
[2026-02-20 20:05:28 UTC] Iteration 2 timed out after 900s
87+
[2026-02-20 20:08:47 UTC] Ralph start | tool=claude | max_iterations=20 | timeout_sec=1200
88+
[2026-02-20 20:08:47 UTC] Iteration 1 started
89+
90+
## 2026-02-20 - US-003 Status Update
91+
- All 7 sg-benchmarks mirror repos exist on GitHub with content pushed (verified via gh api tree checks)
92+
- kubernetes-client-go (30 entries), kubernetes-api (38), expressjs-express (16), lodash (21), prisma-prisma (37), grafana-loki (41), grafana-mimir (39)
93+
- **BLOCKED**: Sourcegraph indexing has not occurred yet. All 7 repos return empty results from mcp__sourcegraph__list_repos and keyword_search
94+
- Cannot mark US-003 passes=true until SG indexing is verified (criterion 8)
95+
- **Learnings for future iterations:**
96+
- GitHub API `size: 0` is delayed stat; use `git/trees/main` to verify content
97+
- SG auto-crawl of sg-benchmarks org may take hours to days for new repos
98+
- The original repos (kubernetes/client-go, grafana/loki, etc.) are also NOT natively indexed on this SG instance
99+
---
100+
101+
## 2026-02-20 - US-004: Verified (pre-existing implementation)
102+
- configs/use_case_registry.json already existed with all 100 entries (created in prior session)
103+
- Validated: all 100 IDs present, correct categories (10 per cat), correct mcp_suite mapping
104+
- Categories A, B, D, E: fully populated with mcp_unique, mcp_capabilities_required, oracle_type, difficulty, estimated_repos_needed, deepsearch_relevant
105+
- Categories C, F, G, H, J: oracle_type='tbd' (stubs). Category I: oracle_type='hybrid' per PRD notes
106+
- deepsearch_relevant: E=10/10, J=4/10 true
107+
- salesforce_flags: 100/100 entries populated
108+
- Schema validation: PASSED against schemas/use_case_registry.schema.json
109+
- Assertion: len(use_cases)==100 PASSED
110+
- Files changed: prd.json (US-004 passes=true)
111+
- **Learnings for future iterations:**
112+
- Always validate pre-existing work before assuming it needs to be redone
113+
- jsonschema Python package is available for schema validation
114+
---
115+
116+
## 2026-02-20 - US-005: TaskSpec schema merging SWE-Factory + PRDBench patterns
117+
- Created `schemas/mcp_task_spec.schema.json` (JSON Schema draft 2020-12)
118+
- Top-level required: id (CCX-<family>-<NNN> pattern), family, use_case_id, category, mcp_suite, prd, artifacts, evaluation, logging
119+
- prd section: user_story, constraints, success_definition, seed_prompt
120+
- artifacts.oracle: required_files, required_symbols, required_references, dependency_chains (NO thresholds)
121+
- evaluation: modes (deterministic|rubric_judge|hybrid), checks (7 types incl test_ratio), eval_script, pass_exit_code
122+
- Optional: rubric_judge (criteria array + weight), deepsearch_variant (enabled, variant_prompt, synthesis_focus)
123+
- Validated with 3 test instances: minimal, hybrid (Cat I test_ratio), Deep Search variant
124+
- Files changed: `schemas/mcp_task_spec.schema.json` (new)
125+
- **Learnings for future iterations:**
126+
- jsonschema on this system is Draft7 (not Draft202012Validator) — use Draft7Validator.check_schema()
127+
- Pattern regex for task ID: `^CCX-[a-z0-9-]+-[0-9]{3}$`
128+
- additionalProperties: false catches structural errors early
129+
---
130+
131+
## 2026-02-20 - US-006: Oracle and evaluation check library
132+
- Created `scripts/ccb_metrics/oracle_checks.py` — stdlib-only Python library
133+
- 7 check functions: file_set_match, symbol_resolution, dependency_chain, provenance, keyword_presence, json_schema, test_ratio
134+
- run_all_checks() aggregates results, computes composite_score as mean of individual primary scores
135+
- CLI mode: `python3 oracle_checks.py --answer <path> --spec <path> [--verbose]`
136+
- Exit code 0 if composite > 0 (useful output), 1 if composite == 0 (total failure)
137+
- 16 doctests all pass, py_compile succeeds
138+
- Tested with sample data: gold answer scores 0.6167, empty answer scores 0.0000
139+
- Files changed: `scripts/ccb_metrics/oracle_checks.py` (new)
140+
- **Learnings for future iterations:**
141+
- _get_primary_score() maps check types to their primary metric (f1, recall, chain_recall, etc.)
142+
- For json_schema_match, use boolean valid as 0/1 score (no jsonschema dep in stdlib mode)
143+
- test_ratio runs subprocess with 300s timeout — adequate for unit test suites
144+
---
145+
146+
## 2026-02-20 - US-007: Automated validity filtering (fail2pass gate)
147+
- Created `scripts/validate_mcp_task_instance.py` — fail2pass validation gate
148+
- CLI: --task-dir (multiple), --verbose, --fix
149+
- Checks gold oracle answer (must score > 0) and empty answer (must score 0)
150+
- Reports: VALID, DEGENERATE_PASS, DEGENERATE_FAIL, BROKEN per task
151+
- --fix mode generates stub oracle_answer.json from task_spec.json oracle definitions
152+
- Tested: valid task → VALID (exit 0), missing oracle → BROKEN (exit 1), --fix → generates stub → VALID
153+
- Files changed: `scripts/validate_mcp_task_instance.py` (new)
154+
- **Learnings for future iterations:**
155+
- oracle_answer.json searched in both task root and tests/ subdirectory
156+
- Stub generation builds "text" field by concatenating all repo/path/symbol refs (for provenance checks)
157+
- tempfile used for empty answer to avoid disk pollution
158+
---

0 commit comments

Comments
 (0)