Skip to content

Commit c4f5cff

Browse files
authored
🤖 bench: switch Terminal Bench GPT 5.2 to GPT 5.4 (#2824)
## Summary - switch the Terminal Bench workflow defaults/examples from `openai/gpt-5.2` to `openai/gpt-5.4` - add GPT-5.4 leaderboard metadata while preserving the GPT-5.2 mapping for mixed or historical artifacts ## Validation - `make static-check` - `python3 -m py_compile benchmarks/terminal_bench/prepare_leaderboard_submission.py` - targeted `python3` verification that workflow defaults now reference GPT 5.4 and metadata preserves both GPT 5.2 and GPT 5.4 entries --- _Generated with `mux` • Model: `openai:gpt-5.4` • Thinking: `xhigh` • Cost: `$0.36`_ <!-- mux-attribution: model=openai:gpt-5.4 thinking=xhigh costs=0.36 -->
1 parent 07767f5 commit c4f5cff

3 files changed

Lines changed: 12 additions & 3 deletions

File tree

.github/workflows/nightly-terminal-bench.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ on:
1010
workflow_dispatch:
1111
inputs:
1212
models:
13-
description: 'Models to test (comma-separated, or "all" for opus-4-6 + gpt-5.3-codex + gpt-5.2 + google/gemini-3-pro-preview + google/gemini-3-flash-preview)'
13+
description: 'Models to test (comma-separated, or "all" for opus-4-6 + gpt-5.3-codex + gpt-5.4 + google/gemini-3-pro-preview + google/gemini-3-flash-preview)'
1414
required: false
1515
default: "all"
1616
type: string
@@ -99,7 +99,7 @@ jobs:
9999
INPUT_MODELS: ${{ inputs.models }}
100100
run: |
101101
if [ "$INPUT_MODELS" = "all" ] || [ -z "$INPUT_MODELS" ]; then
102-
echo 'models=["anthropic/claude-opus-4-6","openai/gpt-5.3-codex","openai/gpt-5.2","google/gemini-3-pro-preview","google/gemini-3-flash-preview"]' >> "$GITHUB_OUTPUT"
102+
echo 'models=["anthropic/claude-opus-4-6","openai/gpt-5.3-codex","openai/gpt-5.4","google/gemini-3-pro-preview","google/gemini-3-flash-preview"]' >> "$GITHUB_OUTPUT"
103103
else
104104
# Convert comma-separated to JSON array
105105
models_json=$(echo "$INPUT_MODELS" | jq -R -s -c 'split(",") | map(gsub("^\\s+|\\s+$"; ""))')

.github/workflows/terminal-bench.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ on:
8787
required: false
8888
type: string
8989
model_name:
90-
description: "Model to use (e.g., anthropic/claude-opus-4-5, openai/gpt-5.2)"
90+
description: "Model to use (e.g., anthropic/claude-opus-4-5, openai/gpt-5.4)"
9191
required: false
9292
type: string
9393
mux_run_args:

benchmarks/terminal_bench/prepare_leaderboard_submission.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,13 +102,22 @@
102102
"model_org_display_name": "Anthropic",
103103
"folder_name": "Claude-Opus-4.6",
104104
},
105+
# Keep historical GPT-5.2 metadata alongside the new GPT-5.4 bench target
106+
# so mixed or older artifact sets still map to the canonical leaderboard names.
105107
"openai/gpt-5.2": {
106108
"model_name": "gpt-5.2",
107109
"model_provider": "openai",
108110
"model_display_name": "GPT-5.2",
109111
"model_org_display_name": "OpenAI",
110112
"folder_name": "GPT-5.2",
111113
},
114+
"openai/gpt-5.4": {
115+
"model_name": "gpt-5.4",
116+
"model_provider": "openai",
117+
"model_display_name": "GPT-5.4",
118+
"model_org_display_name": "OpenAI",
119+
"folder_name": "GPT-5.4",
120+
},
112121
"openai/gpt-5-codex": {
113122
"model_name": "gpt-5-codex",
114123
"model_provider": "openai",

0 commit comments

Comments
 (0)