|
83 | 83 | - SG indexing delay is the bottleneck — create mirrors early, verify later |
84 | 84 | - `.gitignore` has `ralph-*/` pattern; must use `git add -f` for new files under ralph-mcp-unique/ |
85 | 85 | --- |
| 86 | +[2026-02-20 20:05:28 UTC] Iteration 2 timed out after 900s |
| 87 | +[2026-02-20 20:08:47 UTC] Ralph start | tool=claude | max_iterations=20 | timeout_sec=1200 |
| 88 | +[2026-02-20 20:08:47 UTC] Iteration 1 started |
| 89 | + |
| 90 | +## 2026-02-20 - US-003 Status Update |
| 91 | +- All 7 sg-benchmarks mirror repos exist on GitHub with content pushed (verified via gh api tree checks) |
| 92 | +- kubernetes-client-go (30 entries), kubernetes-api (38), expressjs-express (16), lodash (21), prisma-prisma (37), grafana-loki (41), grafana-mimir (39) |
| 93 | +- **BLOCKED**: Sourcegraph indexing has not occurred yet. All 7 repos return empty results from mcp__sourcegraph__list_repos and keyword_search |
| 94 | +- Cannot mark US-003 passes=true until SG indexing is verified (criterion 8) |
| 95 | +- **Learnings for future iterations:** |
| 96 | + - GitHub API `size: 0` is delayed stat; use `git/trees/main` to verify content |
| 97 | + - SG auto-crawl of sg-benchmarks org may take hours to days for new repos |
| 98 | + - The original repos (kubernetes/client-go, grafana/loki, etc.) are also NOT natively indexed on this SG instance |
| 99 | +--- |
| 100 | + |
| 101 | +## 2026-02-20 - US-004: Verified (pre-existing implementation) |
| 102 | +- configs/use_case_registry.json already existed with all 100 entries (created in prior session) |
| 103 | +- Validated: all 100 IDs present, correct categories (10 per cat), correct mcp_suite mapping |
| 104 | +- Categories A, B, D, E: fully populated with mcp_unique, mcp_capabilities_required, oracle_type, difficulty, estimated_repos_needed, deepsearch_relevant |
| 105 | +- Categories C, F, G, H, J: oracle_type='tbd' (stubs). Category I: oracle_type='hybrid' per PRD notes |
| 106 | +- deepsearch_relevant: E=10/10, J=4/10 true |
| 107 | +- salesforce_flags: 100/100 entries populated |
| 108 | +- Schema validation: PASSED against schemas/use_case_registry.schema.json |
| 109 | +- Assertion: len(use_cases)==100 PASSED |
| 110 | +- Files changed: prd.json (US-004 passes=true) |
| 111 | +- **Learnings for future iterations:** |
| 112 | + - Always validate pre-existing work before assuming it needs to be redone |
| 113 | + - jsonschema Python package is available for schema validation |
| 114 | +--- |
| 115 | + |
| 116 | +## 2026-02-20 - US-005: TaskSpec schema merging SWE-Factory + PRDBench patterns |
| 117 | +- Created `schemas/mcp_task_spec.schema.json` (JSON Schema draft 2020-12) |
| 118 | +- Top-level required: id (CCX-<family>-<NNN> pattern), family, use_case_id, category, mcp_suite, prd, artifacts, evaluation, logging |
| 119 | +- prd section: user_story, constraints, success_definition, seed_prompt |
| 120 | +- artifacts.oracle: required_files, required_symbols, required_references, dependency_chains (NO thresholds) |
| 121 | +- evaluation: modes (deterministic|rubric_judge|hybrid), checks (7 types incl test_ratio), eval_script, pass_exit_code |
| 122 | +- Optional: rubric_judge (criteria array + weight), deepsearch_variant (enabled, variant_prompt, synthesis_focus) |
| 123 | +- Validated with 3 test instances: minimal, hybrid (Cat I test_ratio), Deep Search variant |
| 124 | +- Files changed: `schemas/mcp_task_spec.schema.json` (new) |
| 125 | +- **Learnings for future iterations:** |
| 126 | + - jsonschema on this system is Draft7 (not Draft202012Validator) — use Draft7Validator.check_schema() |
| 127 | + - Pattern regex for task ID: `^CCX-[a-z0-9-]+-[0-9]{3}$` |
| 128 | + - additionalProperties: false catches structural errors early |
| 129 | +--- |
| 130 | + |
| 131 | +## 2026-02-20 - US-006: Oracle and evaluation check library |
| 132 | +- Created `scripts/ccb_metrics/oracle_checks.py` — stdlib-only Python library |
| 133 | +- 7 check functions: file_set_match, symbol_resolution, dependency_chain, provenance, keyword_presence, json_schema, test_ratio |
| 134 | +- run_all_checks() aggregates results, computes composite_score as mean of individual primary scores |
| 135 | +- CLI mode: `python3 oracle_checks.py --answer <path> --spec <path> [--verbose]` |
| 136 | +- Exit code 0 if composite > 0 (useful output), 1 if composite == 0 (total failure) |
| 137 | +- 16 doctests all pass, py_compile succeeds |
| 138 | +- Tested with sample data: gold answer scores 0.6167, empty answer scores 0.0000 |
| 139 | +- Files changed: `scripts/ccb_metrics/oracle_checks.py` (new) |
| 140 | +- **Learnings for future iterations:** |
| 141 | + - _get_primary_score() maps check types to their primary metric (f1, recall, chain_recall, etc.) |
| 142 | + - For json_schema_match, use boolean valid as 0/1 score (no jsonschema dep in stdlib mode) |
| 143 | + - test_ratio runs subprocess with 300s timeout — adequate for unit test suites |
| 144 | +--- |
| 145 | + |
| 146 | +## 2026-02-20 - US-007: Automated validity filtering (fail2pass gate) |
| 147 | +- Created `scripts/validate_mcp_task_instance.py` — fail2pass validation gate |
| 148 | +- CLI: --task-dir (multiple), --verbose, --fix |
| 149 | +- Checks gold oracle answer (must score > 0) and empty answer (must score 0) |
| 150 | +- Reports: VALID, DEGENERATE_PASS, DEGENERATE_FAIL, BROKEN per task |
| 151 | +- --fix mode generates stub oracle_answer.json from task_spec.json oracle definitions |
| 152 | +- Tested: valid task → VALID (exit 0), missing oracle → BROKEN (exit 1), --fix → generates stub → VALID |
| 153 | +- Files changed: `scripts/validate_mcp_task_instance.py` (new) |
| 154 | +- **Learnings for future iterations:** |
| 155 | + - oracle_answer.json searched in both task root and tests/ subdirectory |
| 156 | + - Stub generation builds "text" field by concatenating all repo/path/symbol refs (for provenance checks) |
| 157 | + - tempfile used for empty answer to avoid disk pollution |
| 158 | +--- |
0 commit comments