Skip to content

Commit 419b613

Browse files
sjarmakclaude
andcommitted
Exclude broken-verifier runs and add dual-delta + multi-run visibility
- generate_manifest.py: skip trial dirs with __broken_verifier marker (69 false-zero runs no longer leak into run_history) - compute_bootstrap_cis.py: report both delta_latest and delta_mean with bootstrap CIs, comparison table, and sign-flip detection - export_official_results.py: sort tasks by canonical name, add benchmark source links, multi-run variance section, propagate broken-verifier exclusion to exported results Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent fa3a1c0 commit 419b613

File tree

49 files changed

+40382
-49923
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+40382
-49923
lines changed

docs/official_results/README.md

Lines changed: 37 additions & 46 deletions
Large diffs are not rendered by default.

docs/official_results/data/official_results.json

Lines changed: 38857 additions & 48823 deletions
Large diffs are not rendered by default.

docs/official_results/runs/ccb_mcp_compliance_haiku_20260226_035617.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,12 @@
1212

1313
## mcp-remote-direct
1414

15-
- Valid tasks: `6`
16-
- Mean reward: `0.324`
17-
- Pass rate: `0.667`
15+
- Valid tasks: `4`
16+
- Mean reward: `0.485`
17+
- Pass rate: `1.000`
1818

1919
| Task | Status | Reward | MCP Ratio | Tool Calls | Trace |
2020
|---|---|---:|---:|---:|---|
21-
| [mcp_CCX-compliance-115_TuPA0J](../tasks/ccb_mcp_compliance_haiku_20260226_035617--mcp-remote-direct--mcp_CCX-compliance-115_TuPA0J.md) | `failed` | 0.000 | 0.875 | 8 | traj, tx |
22-
| [mcp_CCX-compliance-118_bHjUIx](../tasks/ccb_mcp_compliance_haiku_20260226_035617--mcp-remote-direct--mcp_CCX-compliance-118_bHjUIx.md) | `failed` | 0.000 | 0.000 | 30 | traj, tx |
2321
| [mcp_CCX-compliance-052_pPFvGH](../tasks/ccb_mcp_compliance_haiku_20260226_035617--mcp-remote-direct--mcp_CCX-compliance-052_pPFvGH.md) | `passed` | 0.197 | 0.986 | 74 | traj, tx |
2422
| [mcp_CCX-compliance-053_p3juY2](../tasks/ccb_mcp_compliance_haiku_20260226_035617--mcp-remote-direct--mcp_CCX-compliance-053_p3juY2.md) | `passed` | 0.467 | 0.962 | 26 | traj, tx |
2523
| [mcp_ccx-compliance-051_YFml2Q](../tasks/ccb_mcp_compliance_haiku_20260226_035617--mcp-remote-direct--mcp_ccx-compliance-051_YFml2Q.md) | `passed` | 0.514 | 0.974 | 39 | traj, tx |

docs/official_results/runs/ccb_mcp_compliance_haiku_20260226_035622_variance.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,12 @@
1212

1313
## mcp-remote-direct
1414

15-
- Valid tasks: `6`
16-
- Mean reward: `0.394`
17-
- Pass rate: `0.667`
15+
- Valid tasks: `4`
16+
- Mean reward: `0.590`
17+
- Pass rate: `1.000`
1818

1919
| Task | Status | Reward | MCP Ratio | Tool Calls | Trace |
2020
|---|---|---:|---:|---:|---|
21-
| [mcp_CCX-compliance-115_W3myQD](../tasks/ccb_mcp_compliance_haiku_20260226_035622_variance--mcp-remote-direct--mcp_CCX-compliance-115_W3myQD.md) | `failed` | 0.000 | 0.857 | 7 | traj, tx |
22-
| [mcp_CCX-compliance-118_9HEjfe](../tasks/ccb_mcp_compliance_haiku_20260226_035622_variance--mcp-remote-direct--mcp_CCX-compliance-118_9HEjfe.md) | `failed` | 0.000 | 0.941 | 17 | traj, tx |
2321
| [mcp_CCX-compliance-052_j0NGKx](../tasks/ccb_mcp_compliance_haiku_20260226_035622_variance--mcp-remote-direct--mcp_CCX-compliance-052_j0NGKx.md) | `passed` | 0.663 | 0.982 | 55 | traj, tx |
2422
| [mcp_CCX-compliance-053_QYb1jb](../tasks/ccb_mcp_compliance_haiku_20260226_035622_variance--mcp-remote-direct--mcp_CCX-compliance-053_QYb1jb.md) | `passed` | 0.381 | 0.926 | 27 | traj, tx |
2523
| [mcp_ccx-compliance-051_YBC7VN](../tasks/ccb_mcp_compliance_haiku_20260226_035622_variance--mcp-remote-direct--mcp_ccx-compliance-051_YBC7VN.md) | `passed` | 0.417 | 0.976 | 41 | traj, tx |

docs/official_results/runs/ccb_mcp_compliance_haiku_20260226_035628_variance.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,12 @@
1212

1313
## mcp-remote-direct
1414

15-
- Valid tasks: `6`
16-
- Mean reward: `0.365`
17-
- Pass rate: `0.667`
15+
- Valid tasks: `4`
16+
- Mean reward: `0.548`
17+
- Pass rate: `1.000`
1818

1919
| Task | Status | Reward | MCP Ratio | Tool Calls | Trace |
2020
|---|---|---:|---:|---:|---|
21-
| [mcp_CCX-compliance-115_VaKqhC](../tasks/ccb_mcp_compliance_haiku_20260226_035628_variance--mcp-remote-direct--mcp_CCX-compliance-115_VaKqhC.md) | `failed` | 0.000 | 0.889 | 9 | traj, tx |
22-
| [mcp_CCX-compliance-118_vVcPZp](../tasks/ccb_mcp_compliance_haiku_20260226_035628_variance--mcp-remote-direct--mcp_CCX-compliance-118_vVcPZp.md) | `failed` | 0.000 | 0.941 | 17 | traj, tx |
2321
| [mcp_CCX-compliance-052_8DfElE](../tasks/ccb_mcp_compliance_haiku_20260226_035628_variance--mcp-remote-direct--mcp_CCX-compliance-052_8DfElE.md) | `passed` | 0.181 | 0.986 | 69 | traj, tx |
2422
| [mcp_CCX-compliance-053_A6hndW](../tasks/ccb_mcp_compliance_haiku_20260226_035628_variance--mcp-remote-direct--mcp_CCX-compliance-053_A6hndW.md) | `passed` | 0.651 | 0.913 | 23 | traj, tx |
2523
| [mcp_ccx-compliance-051_CKuWlR](../tasks/ccb_mcp_compliance_haiku_20260226_035628_variance--mcp-remote-direct--mcp_ccx-compliance-051_CKuWlR.md) | `passed` | 0.629 | 0.881 | 59 | traj, tx |

docs/official_results/runs/ccb_mcp_compliance_haiku_20260226_035633_variance.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,12 @@
1212

1313
## mcp-remote-direct
1414

15-
- Valid tasks: `6`
16-
- Mean reward: `0.426`
17-
- Pass rate: `0.667`
15+
- Valid tasks: `4`
16+
- Mean reward: `0.638`
17+
- Pass rate: `1.000`
1818

1919
| Task | Status | Reward | MCP Ratio | Tool Calls | Trace |
2020
|---|---|---:|---:|---:|---|
21-
| [mcp_CCX-compliance-115_yNfk8S](../tasks/ccb_mcp_compliance_haiku_20260226_035633_variance--mcp-remote-direct--mcp_CCX-compliance-115_yNfk8S.md) | `failed` | 0.000 | 0.900 | 10 | traj, tx |
22-
| [mcp_CCX-compliance-118_AvJ4Oy](../tasks/ccb_mcp_compliance_haiku_20260226_035633_variance--mcp-remote-direct--mcp_CCX-compliance-118_AvJ4Oy.md) | `failed` | 0.000 | 0.900 | 20 | traj, tx |
2321
| [mcp_CCX-compliance-052_IiH4T2](../tasks/ccb_mcp_compliance_haiku_20260226_035633_variance--mcp-remote-direct--mcp_CCX-compliance-052_IiH4T2.md) | `passed` | 0.137 | 0.000 | 85 | traj, tx |
2422
| [mcp_CCX-compliance-053_fgVrO8](../tasks/ccb_mcp_compliance_haiku_20260226_035633_variance--mcp-remote-direct--mcp_CCX-compliance-053_fgVrO8.md) | `passed` | 0.726 | 0.962 | 26 | traj, tx |
2523
| [mcp_ccx-compliance-051_90WMYT](../tasks/ccb_mcp_compliance_haiku_20260226_035633_variance--mcp-remote-direct--mcp_ccx-compliance-051_90WMYT.md) | `passed` | 0.846 | 0.932 | 44 | traj, tx |

docs/official_results/runs/ccb_mcp_crossrepo_tracing_haiku_20260226_035617.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,12 @@
22

33
## mcp-remote-direct
44

5-
- Valid tasks: `4`
6-
- Mean reward: `0.501`
7-
- Pass rate: `0.750`
5+
- Valid tasks: `3`
6+
- Mean reward: `0.669`
7+
- Pass rate: `1.000`
88

99
| Task | Status | Reward | MCP Ratio | Tool Calls | Trace |
1010
|---|---|---:|---:|---:|---|
11-
| [mcp_CCX-dep-trace-116_Vh8Tjf](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035617--mcp-remote-direct--mcp_CCX-dep-trace-116_Vh8Tjf.md) | `failed` | 0.000 | 0.941 | 17 | traj, tx |
1211
| [mcp_CCX-config-trace-003_iEFkPj](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035617--mcp-remote-direct--mcp_CCX-config-trace-003_iEFkPj.md) | `passed` | 0.472 | 0.947 | 19 | traj, tx |
1312
| [mcp_CCX-dep-trace-002_liuH6e](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035617--mcp-remote-direct--mcp_CCX-dep-trace-002_liuH6e.md) | `passed` | 1.000 | 0.895 | 19 | traj, tx |
1413
| [mcp_CCX-dep-trace-102_6nybzq](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035617--mcp-remote-direct--mcp_CCX-dep-trace-102_6nybzq.md) | `passed` | 0.533 | 0.882 | 17 | traj, tx |

docs/official_results/runs/ccb_mcp_crossrepo_tracing_haiku_20260226_035622_variance.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,12 @@
22

33
## mcp-remote-direct
44

5-
- Valid tasks: `4`
6-
- Mean reward: `0.572`
7-
- Pass rate: `0.750`
5+
- Valid tasks: `3`
6+
- Mean reward: `0.762`
7+
- Pass rate: `1.000`
88

99
| Task | Status | Reward | MCP Ratio | Tool Calls | Trace |
1010
|---|---|---:|---:|---:|---|
11-
| [mcp_CCX-dep-trace-116_qOPSG2](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035622_variance--mcp-remote-direct--mcp_CCX-dep-trace-116_qOPSG2.md) | `failed` | 0.000 | 0.957 | 23 | traj, tx |
1211
| [mcp_CCX-config-trace-003_9mmV8P](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035622_variance--mcp-remote-direct--mcp_CCX-config-trace-003_9mmV8P.md) | `passed` | 0.421 | 0.905 | 21 | traj, tx |
1312
| [mcp_CCX-dep-trace-002_wZ1b08](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035622_variance--mcp-remote-direct--mcp_CCX-dep-trace-002_wZ1b08.md) | `passed` | 1.000 | 0.941 | 17 | traj, tx |
1413
| [mcp_CCX-dep-trace-102_WJcLPU](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035622_variance--mcp-remote-direct--mcp_CCX-dep-trace-102_WJcLPU.md) | `passed` | 0.867 | 0.952 | 21 | traj, tx |

docs/official_results/runs/ccb_mcp_crossrepo_tracing_haiku_20260226_035628_variance.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,12 @@
22

33
## mcp-remote-direct
44

5-
- Valid tasks: `4`
6-
- Mean reward: `0.567`
7-
- Pass rate: `0.750`
5+
- Valid tasks: `3`
6+
- Mean reward: `0.756`
7+
- Pass rate: `1.000`
88

99
| Task | Status | Reward | MCP Ratio | Tool Calls | Trace |
1010
|---|---|---:|---:|---:|---|
11-
| [mcp_CCX-dep-trace-116_7a7rE1](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035628_variance--mcp-remote-direct--mcp_CCX-dep-trace-116_7a7rE1.md) | `failed` | 0.000 | 0.952 | 21 | traj, tx |
1211
| [mcp_CCX-config-trace-003_xkTKc3](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035628_variance--mcp-remote-direct--mcp_CCX-config-trace-003_xkTKc3.md) | `passed` | 0.467 | 0.947 | 19 | traj, tx |
1312
| [mcp_CCX-dep-trace-002_v4KtU5](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035628_variance--mcp-remote-direct--mcp_CCX-dep-trace-002_v4KtU5.md) | `passed` | 1.000 | 0.909 | 22 | traj, tx |
1413
| [mcp_CCX-dep-trace-102_93NryZ](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035628_variance--mcp-remote-direct--mcp_CCX-dep-trace-102_93NryZ.md) | `passed` | 0.800 | 0.952 | 21 | traj, tx |

docs/official_results/runs/ccb_mcp_crossrepo_tracing_haiku_20260226_035633_variance.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,12 @@
22

33
## mcp-remote-direct
44

5-
- Valid tasks: `4`
6-
- Mean reward: `0.446`
7-
- Pass rate: `0.750`
5+
- Valid tasks: `3`
6+
- Mean reward: `0.595`
7+
- Pass rate: `1.000`
88

99
| Task | Status | Reward | MCP Ratio | Tool Calls | Trace |
1010
|---|---|---:|---:|---:|---|
11-
| [mcp_CCX-dep-trace-116_BJUD1J](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035633_variance--mcp-remote-direct--mcp_CCX-dep-trace-116_BJUD1J.md) | `failed` | 0.000 | 0.967 | 30 | traj, tx |
1211
| [mcp_CCX-config-trace-003_xV8ARc](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035633_variance--mcp-remote-direct--mcp_CCX-config-trace-003_xV8ARc.md) | `passed` | 0.472 | 0.773 | 22 | traj, tx |
1312
| [mcp_CCX-dep-trace-002_9jCFFt](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035633_variance--mcp-remote-direct--mcp_CCX-dep-trace-002_9jCFFt.md) | `passed` | 0.846 | 0.923 | 13 | traj, tx |
1413
| [mcp_CCX-dep-trace-102_xB3mHY](../tasks/ccb_mcp_crossrepo_tracing_haiku_20260226_035633_variance--mcp-remote-direct--mcp_CCX-dep-trace-102_xB3mHY.md) | `passed` | 0.467 | 0.909 | 11 | traj, tx |

0 commit comments

Comments
 (0)