Skip to content

Commit 272128f

Browse files
sjarmakclaude
andcommitted
Rename "MCP-unique" terminology to "org-scale" across codebase
Org suites are organizational use-case benchmarks (cross-repo discovery, compliance, incident triage, migration, etc.) that both baseline and MCP agents can solve — they are not MCP-dependent. This commit: - Renames mcp_unique=true to org_scale=true in 241 task.toml files - Updates 241 oracle_checks.py docstrings and 241 eval.sh comments - Renames docs/MCP_UNIQUE_TASKS.md → docs/ORG_TASKS.md - Renames docs/MCP_UNIQUE_CALIBRATION.md → docs/ORG_CALIBRATION.md - Updates script variable names (MCP_UNIQUE_SUITES → ORG_SUITES, etc.) - Adds 41 missing org tasks to selected_benchmark_tasks.json (T.7 PASS) - All 20 suites remain Grade A with zero failures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent b2b168b commit 272128f

File tree

741 files changed

+1440
-769
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

741 files changed

+1440
-769
lines changed

README.md

Lines changed: 3 additions & 3 deletions

benchmarks/README.md

Lines changed: 1 addition & 1 deletion

benchmarks/backups/csb_org_compliance_doe_trim/ccx-compliance-057/task.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ time_limit_sec = 900
1515
mcp_suite = "csb_org_compliance"
1616
use_case_id = 57
1717
repo_set_id = "grafana-observability"
18-
mcp_unique = true
18+
org_scale = true
1919
verification_modes = ["artifact"]
2020

2121
[verification]

benchmarks/backups/csb_org_compliance_doe_trim/ccx-compliance-057/tests/eval.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/bin/bash
2-
# eval.sh — MCP-unique benchmark evaluator for CCX-compliance-057
2+
# eval.sh — org-scale benchmark evaluator for CCX-compliance-057
33
# Exit-code-first (SWE-Factory pattern):
44
# exit 0 — agent produced useful output (composite score > 0)
55
# exit 1 — total failure (composite score == 0 or missing answer)

benchmarks/backups/csb_org_compliance_doe_trim/ccx-compliance-057/tests/oracle_checks.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/usr/bin/env python3
2-
"""Deterministic oracle check library for MCP-unique benchmark evaluation.
2+
"""Deterministic oracle check library for org-scale benchmark evaluation.
33
44
Provides reusable check functions that eval.sh scripts invoke to score agent
55
answers against closed-world oracle definitions. Returns raw scores (no

benchmarks/backups/csb_org_compliance_doe_trim/ccx-compliance-188/task.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ time_limit_sec = 900
1515
mcp_suite = "csb_org_compliance"
1616
use_case_id = 188
1717
repo_set_id = "envoy-service-mesh"
18-
mcp_unique = true
18+
org_scale = true
1919
verification_modes = ["artifact"]
2020

2121
[verification]

benchmarks/backups/csb_org_compliance_doe_trim/ccx-compliance-188/tests/eval.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/bin/bash
2-
# eval.sh — MCP-unique benchmark evaluator for CCX-compliance-188
2+
# eval.sh — org-scale benchmark evaluator for CCX-compliance-188
33
# Exit-code-first (SWE-Factory pattern):
44
# exit 0 — agent produced useful output (composite score > 0)
55
# exit 1 — total failure (composite score == 0 or missing answer)

benchmarks/backups/csb_org_compliance_doe_trim/ccx-compliance-188/tests/oracle_checks.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/usr/bin/env python3
2-
"""Deterministic oracle check library for MCP-unique benchmark evaluation.
2+
"""Deterministic oracle check library for org-scale benchmark evaluation.
33
44
Provides reusable check functions that eval.sh scripts invoke to score agent
55
answers against closed-world oracle definitions. Returns raw scores (no

benchmarks/backups/csb_org_compliance_extra/ccx-compliance-057-ds/task.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ time_limit_sec = 1200
1515
mcp_suite = "csb_org_compliance"
1616
use_case_id = 57
1717
repo_set_id = "grafana-observability"
18-
mcp_unique = true
18+
org_scale = true
1919
deepsearch_relevant = true
2020

2121
[verification]

benchmarks/backups/csb_org_compliance_extra/ccx-compliance-057-ds/tests/eval.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/bin/bash
2-
# eval.sh — MCP-unique benchmark evaluator for CCX-compliance-057-ds
2+
# eval.sh — org-scale benchmark evaluator for CCX-compliance-057-ds
33
# Exit-code-first (SWE-Factory pattern):
44
# exit 0 — agent produced useful output (composite score > 0)
55
# exit 1 — total failure (composite score == 0 or missing answer)

0 commit comments

Comments
 (0)