Skip to content

Commit 4ff6274

Browse files
sjarmakclaude
andcommitted
feat: add ccb_test variance run for 4 gap tasks, close 3/4 gaps
- numpy-array-sum-perf-001, pandas-groupby-perf-001, sklearn-kmeans-perf-001 now have >= 3 valid runs in both configs - curl-security-review-001 MCP has systemic RewardFileNotFoundError (verifier bug, not coverage issue — needs fix) - 177/178 SDLC tasks now have >= 3 valid scored runs in both configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 07735ed commit 4ff6274

File tree

176 files changed

+79909
-287
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

176 files changed

+79909
-287
lines changed

docs/official_results/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This bundle is generated from `runs/official/` and includes only valid scored tasks (`passed`/`failed` with numeric reward).
44

5-
Generated: `2026-03-01T19:12:02.665318+00:00`
5+
Generated: `2026-03-01T19:45:06.820720+00:00`
66

77
## Local Browse
88

@@ -69,7 +69,7 @@ Historical reruns/backfills remain available in `data/official_results.json` und
6969
| [ccb_refactor](suites/ccb_refactor.md) | `mcp-remote-direct` | 20 | 20 | 0.703 | 1.000 | ok |
7070
| [ccb_secure](suites/ccb_secure.md) | `baseline-local-direct` | 20 | 20 | 0.712 | 1.000 | ok |
7171
| [ccb_secure](suites/ccb_secure.md) | `mcp-remote-direct` | 24 | 20 | 0.707 | 0.958 | ok |
72-
| [ccb_test](suites/ccb_test.md) | `baseline-local-direct` | 20 | 20 | 0.484 | 0.700 | ok |
72+
| [ccb_test](suites/ccb_test.md) | `baseline-local-direct` | 20 | 20 | 0.473 | 0.700 | ok |
7373
| [ccb_test](suites/ccb_test.md) | `mcp-remote-direct` | 35 | 20 | 0.468 | 0.686 | ok |
7474
| [ccb_understand](suites/ccb_understand.md) | `baseline-local-direct` | 34 | 20 | 0.902 | 0.971 | ok |
7575
| [ccb_understand](suites/ccb_understand.md) | `mcp-remote-direct` | 48 | 20 | 0.873 | 0.979 | ok |
@@ -398,6 +398,8 @@ Historical reruns/backfills remain available in `data/official_results.json` und
398398
| [test_haiku_20260301_031851](runs/test_haiku_20260301_031851.md) | `ccb_test` | `mcp-remote-direct` | 8 | 0.769 | 1.000 |
399399
| [test_haiku_20260301_071232](runs/test_haiku_20260301_071232.md) | `ccb_test` | `baseline-local-direct` | 17 | 0.569 | 0.824 |
400400
| [test_haiku_20260301_071232](runs/test_haiku_20260301_071232.md) | `ccb_test` | `mcp-remote-direct` | 8 | 0.780 | 1.000 |
401+
| [test_haiku_20260301_192246](runs/test_haiku_20260301_192246.md) | `ccb_test` | `baseline-local-direct` | 4 | 0.128 | 0.250 |
402+
| [test_haiku_20260301_192246](runs/test_haiku_20260301_192246.md) | `ccb_test` | `mcp-remote-direct` | 3 | 0.000 | 0.000 |
401403
| [understand_haiku_20260224_001815](runs/understand_haiku_20260224_001815.md) | `ccb_understand` | `baseline-local-direct` | 20 | 0.533 | 0.650 |
402404
| [understand_haiku_20260224_001815](runs/understand_haiku_20260224_001815.md) | `ccb_understand` | `mcp-remote-direct` | 20 | 0.679 | 0.850 |
403405
| [understand_haiku_20260225_211346](runs/understand_haiku_20260225_211346.md) | `ccb_understand` | `baseline-local-direct` | 7 | 0.789 | 1.000 |

docs/official_results/audits/test_haiku_20260301_192246--baseline-local-direct--curl-security-review-001.json

Lines changed: 861 additions & 0 deletions
Large diffs are not rendered by default.

docs/official_results/audits/test_haiku_20260301_192246--baseline-local-direct--numpy-array-sum-perf-001.json

Lines changed: 2483 additions & 0 deletions
Large diffs are not rendered by default.

docs/official_results/audits/test_haiku_20260301_192246--baseline-local-direct--pandas-groupby-perf-001.json

Lines changed: 4594 additions & 0 deletions
Large diffs are not rendered by default.

docs/official_results/audits/test_haiku_20260301_192246--baseline-local-direct--sklearn-kmeans-perf-001.json

Lines changed: 1007 additions & 0 deletions
Large diffs are not rendered by default.

docs/official_results/audits/test_haiku_20260301_192246--mcp-remote-direct--sgonly_numpy-array-sum-perf-001.json

Lines changed: 2184 additions & 0 deletions
Large diffs are not rendered by default.

docs/official_results/audits/test_haiku_20260301_192246--mcp-remote-direct--sgonly_pandas-groupby-perf-001.json

Lines changed: 2779 additions & 0 deletions
Large diffs are not rendered by default.

docs/official_results/audits/test_haiku_20260301_192246--mcp-remote-direct--sgonly_sklearn-kmeans-perf-001.json

Lines changed: 1449 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)