Skip to content

Commit f3657f5

Browse files
Copilotmrjf
andauthored
openevolve: wait for CI + run benchmark in CI; extract evaluate.sh
Agent-Logs-Url: https://github.com/githubnext/tsessebe/sessions/4ffc84f5-3ff8-4a4a-a946-14eeae0ee263 Co-authored-by: mrjf <180956+mrjf@users.noreply.github.com>
1 parent 9d3524a commit f3657f5

4 files changed

Lines changed: 216 additions & 21 deletions

File tree

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
#!/usr/bin/env bash
2+
# Evaluator for the tsb-perf-evolve OpenEvolve program.
3+
#
4+
# Both the autoloop agent (Step 6 of the OpenEvolve playbook) and CI (the
5+
# `benchmark` job in .github/workflows/ci.yml) invoke this script so they
6+
# produce comparable fitness numbers from identical commands.
7+
#
8+
# Output: a single JSON line on stdout with one of these shapes
9+
# {"fitness": <number>, "tsb_mean_ms": <number>, "pandas_mean_ms": <number>}
10+
# {"fitness": null, "rejected_reason": "<string>"}
11+
#
12+
# Exit code is always 0 — failures are encoded in the JSON so callers can
13+
# parse the result uniformly. Diagnostics go to stderr.
14+
15+
set -euo pipefail
16+
17+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
18+
REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
19+
20+
cd "$REPO_ROOT"
21+
22+
# 1. Validity — existing tests for sortValues must still pass.
23+
if ! bun test tests/core/series.sortValues.test.ts >/tmp/perf-evolve-tests.log 2>&1; then
24+
echo '{"fitness": null, "rejected_reason": "tests failed"}'
25+
exit 0
26+
fi
27+
28+
# 2. Benchmark — tsb side.
29+
tsb_ms=$(bun run "$SCRIPT_DIR/code/benchmark.ts" \
30+
| python3 -c "import json,sys; print(json.load(sys.stdin)['mean_ms'])")
31+
32+
# 3. Benchmark — pandas side. Skip gracefully if pandas isn't available.
33+
if ! python3 -c 'import pandas' 2>/dev/null; then
34+
pip3 install pandas --quiet 2>/dev/null || true
35+
fi
36+
pd_ms=$(python3 "$SCRIPT_DIR/code/benchmark.py" \
37+
| python3 -c "import json,sys; print(json.load(sys.stdin)['mean_ms'])")
38+
39+
# 4. Fitness = ratio. Lower is better.
40+
ratio=$(python3 -c "print(${tsb_ms} / ${pd_ms})")
41+
echo "{\"fitness\": ${ratio}, \"tsb_mean_ms\": ${tsb_ms}, \"pandas_mean_ms\": ${pd_ms}}"

.autoloop/programs/tsb-perf-evolve/program.md

Lines changed: 19 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -55,26 +55,25 @@ Population state lives in the state file on the `memory/autoloop` branch under t
5555
## Evaluation
5656

5757
```bash
58-
set -euo pipefail
59-
60-
# 1. Validity — existing tests for sortValues must still pass.
61-
bun test tests/core/series.sortValues.test.ts >/tmp/perf-evolve-tests.log 2>&1 || {
62-
echo '{"fitness": null, "rejected_reason": "tests failed"}'
63-
exit 0
64-
}
65-
66-
# 2. Benchmark — tsb side.
67-
tsb_ms=$(bun run .autoloop/programs/tsb-perf-evolve/code/benchmark.ts | python3 -c "import json,sys; print(json.load(sys.stdin)['mean_ms'])")
68-
69-
# 3. Benchmark — pandas side. Skip gracefully if pandas isn't available.
70-
if ! python3 -c 'import pandas' 2>/dev/null; then
71-
pip3 install pandas --quiet 2>/dev/null || true
72-
fi
73-
pd_ms=$(python3 .autoloop/programs/tsb-perf-evolve/code/benchmark.py | python3 -c "import json,sys; print(json.load(sys.stdin)['mean_ms'])")
74-
75-
# 4. Fitness = ratio. Lower is better.
76-
ratio=$(python3 -c "print(${tsb_ms} / ${pd_ms})")
77-
echo "{\"fitness\": ${ratio}, \"tsb_mean_ms\": ${tsb_ms}, \"pandas_mean_ms\": ${pd_ms}}"
58+
bash .autoloop/programs/tsb-perf-evolve/evaluate.sh
59+
```
60+
61+
The actual evaluator lives in `evaluate.sh` next to this file so the autoloop
62+
agent (Step 6 of the OpenEvolve playbook) and CI (the `benchmark` job in
63+
`.github/workflows/ci.yml`) invoke the **exact same** command and produce
64+
comparable fitness numbers. See that script for details.
65+
66+
It runs the validity tests, then the tsb and pandas benchmarks, and prints a
67+
single JSON line on stdout:
68+
69+
```json
70+
{"fitness": <number>, "tsb_mean_ms": <number>, "pandas_mean_ms": <number>}
71+
```
72+
73+
or, if validity failed:
74+
75+
```json
76+
{"fitness": null, "rejected_reason": "tests failed"}
7877
```
7978

8079
The metric is `fitness` (= `tsb_mean_ms / pandas_mean_ms`). **Lower is better.** A value below `1.0` means tsb is now faster than pandas on this workload.

.autoloop/strategies/openevolve/strategy.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,51 @@ Edit only the files listed in `program.md`'s Target section. The diff style for
7575

7676
Run the evaluation command from `program.md`. Parse the metric.
7777

78+
The in-sandbox evaluation is a *cheap pre-filter only* — the agent sandbox often cannot install `bun`, run `python3 -c 'import pandas'`, or otherwise reproduce realistic conditions (the `releaseassets.githubusercontent.com` firewall block is the common culprit). A null/missing metric here is **not** grounds for rejecting the candidate; that decision is deferred to Step 6.5.
79+
80+
### Step 6.5. Wait for CI
81+
82+
Before recording the candidate in the population (Step 7) or posting *any* iteration comment on the program issue / PR, wait for CI on the pushed commit. CI is the authoritative source of both correctness (Test & Lint / Build / Validate Python Examples) and fitness (the `OpenEvolve benchmark` check, which runs `bash .autoloop/programs/{program-name}/evaluate.sh` on a real runner with `bun` + `python3` + `pandas` installed).
83+
84+
This step extends — and ties into — the generic `Step 5a → 5b → 5c` flow described in the autoloop workflow. OpenEvolve's only added requirement is that you must reach Step 5c (or the budget-exhausted handler) **before** writing the iteration comment, never after a speculative push.
85+
86+
```bash
87+
# Resolve the PR — prefer the pre-step lookup, fall back to gh.
88+
PR=$(jq -r '.existing_pr // empty' /tmp/gh-aw/autoloop.json 2>/dev/null || true)
89+
if [ -z "$PR" ]; then
90+
PR=$(gh pr list --head autoloop/{program-name} --json number -q '.[0].number')
91+
fi
92+
93+
# Block until every required check terminates (or the wall-clock cap fires).
94+
gh pr checks "$PR" --watch --interval 30 --fail-fast || true
95+
96+
# Determine an aggregate status. Same awk classifier as Step 5a in the
97+
# generic autoloop playbook — keep them in sync.
98+
status=$(gh pr checks "$PR" --json conclusion,state \
99+
-q '.[] | (.conclusion // .state // "")' \
100+
| awk '
101+
BEGIN { r = "success" }
102+
/^(FAILURE|CANCELLED|TIMED_OUT|ACTION_REQUIRED|STARTUP_FAILURE|STALE)$/ { r = "failure" }
103+
/^(PENDING|QUEUED|IN_PROGRESS|WAITING|REQUESTED)$/ { if (r == "success") r = "pending" }
104+
END { print r }')
105+
106+
# Read the fitness from the OpenEvolve benchmark check-run (created by the
107+
# `benchmark` job in .github/workflows/ci.yml). Title format: `fitness=<num>`
108+
# or `fitness=null`. SHA = the HEAD of the PR after the latest push/fix.
109+
SHA=$(gh pr view "$PR" --json headRefOid -q '.headRefOid')
110+
fitness=$(gh api "repos/${GITHUB_REPOSITORY}/commits/${SHA}/check-runs" \
111+
--jq '.check_runs[] | select(.name == "OpenEvolve benchmark") | .output.title' \
112+
| sed -n 's/^fitness=//p' | head -n1)
113+
```
114+
115+
Branch on `$status`:
116+
117+
- **`success`** → record the candidate in the population with `fitness: <number>` from the check-run (or `fitness: null` only if the `OpenEvolve benchmark` check explicitly reported it that way — e.g., correctness held but the benchmark itself errored). Proceed to Step 7. The iteration comment is `✅ Accepted` with the real numeric fitness.
118+
- **`failure`** → enter the fix-retry loop from the generic autoloop Step 5b (up to 5 attempts, no-progress guard, 60-min wall-clock cap). Do **not** post an "accepted" comment. On a successful fix, loop back through the `gh pr checks --watch` block above on the new HEAD. On exhausted budget, mark the candidate `status: error` in the population with `fitness: null` and `pause_reason: "ci-fix-exhausted: <signature>"`, and post a `❌ Rejected` (or `⚠️ Error`) iteration comment that links to the failing run.
119+
- **`pending`** (the wall-clock cap fired before CI concluded) → don't post a speculative `⏳ Pending CI` comment. Record the candidate in the population with `fitness: null` and `status: pending-ci`, and leave a single reconciliation-pending comment on the PR/issue that the next iteration's Step 6.5 is allowed to overwrite when it reads the now-concluded status for this same SHA.
120+
121+
In all three branches, the iteration comment posted to the program issue and PR must reflect *terminal* state — never `⏳ Pending CI` as a permanent label. Comments live forever; the pending placeholder is what produced the bug this step exists to fix.
122+
78123
### Step 7. Update the population
79124

80125
Regardless of whether the iteration is accepted or rejected at the branch level, the candidate has been tried and should be recorded in the population — the population is a memory of what's been explored, not just what's been kept.
@@ -88,7 +133,8 @@ Append a new entry to the `## 🧬 Population` subsection in the state file usin
88133

89134
Continue with the normal autoloop Step 5 (Accept or Reject → commit / discard, update state file's Machine State, Iteration History, Lessons Learned, etc.) as defined in the workflow. The only additional requirements from OpenEvolve are:
90135

91-
- The Iteration History entry must include `operator`, `parent_id(s)`, `island`, and `fitness` fields (in addition to the normal status/change/metric/notes).
136+
- The Iteration History entry must include `operator`, `parent_id(s)`, `island`, and `fitness` fields (in addition to the normal status/change/metric/notes). The `fitness` value comes from the `OpenEvolve benchmark` check-run resolved in Step 6.5 — never from the in-sandbox Step 6 estimate.
137+
- The iteration comment posted to the program issue and PR must use the terminal status from Step 6.5 (`✅ Accepted` / `❌ Rejected` / `⚠️ Error` / `⏸ Pending-CI` only when the wall-clock cap genuinely fired). Never post `⏳ Pending CI` as a final state — that placeholder is what Step 6.5 exists to eliminate.
92138
- Lessons Learned additions should be phrased as *transferable heuristics* about the problem space, not as reports of what this iteration did. (E.g. "Hex layouts dominate grid layouts above n=20" — not "Iteration 17 tried a hex layout.")
93139

94140
## Feature dimensions

.github/workflows/ci.yml

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ on:
1111

1212
permissions:
1313
contents: read
14+
checks: write
1415

1516
jobs:
1617
test:
@@ -76,3 +77,111 @@ jobs:
7677

7778
- name: Validate Python playground examples
7879
run: python scripts/validate-python-examples.py playground/
80+
81+
benchmark:
82+
# Run the OpenEvolve benchmark for autoloop *-evolve PRs so the autoloop
83+
# agent can read a real fitness number from CI (see .autoloop/strategies/
84+
# openevolve/strategy.md, Step 6.5). The sandbox the agent runs in cannot
85+
# install bun reliably and so cannot measure fitness itself.
86+
name: OpenEvolve benchmark
87+
if: |
88+
(github.event_name == 'pull_request' && startsWith(github.head_ref, 'autoloop/') && contains(github.head_ref, '-evolve'))
89+
|| (github.event_name == 'push' && startsWith(github.ref_name, 'autoloop/') && contains(github.ref_name, '-evolve'))
90+
runs-on: ubuntu-latest
91+
permissions:
92+
contents: read
93+
checks: write
94+
steps:
95+
- uses: actions/checkout@v4
96+
97+
- name: Setup Bun
98+
uses: oven-sh/setup-bun@v2
99+
with:
100+
bun-version: latest
101+
102+
- name: Install dependencies
103+
run: bun install
104+
105+
- name: Setup Python
106+
uses: actions/setup-python@v5
107+
with:
108+
python-version: "3.12"
109+
110+
- name: Install Python dependencies
111+
run: pip install pandas numpy
112+
113+
- name: Resolve program directory
114+
id: program
115+
run: |
116+
# Resolve the program directory from the branch name:
117+
# autoloop/<program-name> → .autoloop/programs/<program-name>/
118+
BRANCH="${GITHUB_HEAD_REF:-${GITHUB_REF_NAME}}"
119+
PROGRAM="${BRANCH#autoloop/}"
120+
PROGRAM_DIR=".autoloop/programs/${PROGRAM}"
121+
echo "program=${PROGRAM}" >> "$GITHUB_OUTPUT"
122+
echo "program_dir=${PROGRAM_DIR}" >> "$GITHUB_OUTPUT"
123+
if [ -x "${PROGRAM_DIR}/evaluate.sh" ]; then
124+
echo "has_evaluator=true" >> "$GITHUB_OUTPUT"
125+
else
126+
echo "No evaluate.sh for program '${PROGRAM}' — skipping benchmark." >&2
127+
echo "has_evaluator=false" >> "$GITHUB_OUTPUT"
128+
fi
129+
130+
- name: Run OpenEvolve benchmark
131+
id: bench
132+
if: steps.program.outputs.has_evaluator == 'true'
133+
run: |
134+
PROGRAM_DIR="${{ steps.program.outputs.program_dir }}"
135+
# evaluate.sh is contracted to always exit 0 and encode failures in
136+
# the JSON, but we tolerate non-zero exits anyway and fall back to a
137+
# null fitness so the check-run still gets created.
138+
set +e
139+
bash "${PROGRAM_DIR}/evaluate.sh" >/tmp/bench-result.json 2>/tmp/bench-stderr
140+
rc=$?
141+
set -e
142+
if [ ! -s /tmp/bench-result.json ]; then
143+
echo "{\"fitness\": null, \"rejected_reason\": \"evaluator produced no output (exit ${rc})\"}" \
144+
> /tmp/bench-result.json
145+
fi
146+
cat /tmp/bench-result.json
147+
fitness=$(jq -r '.fitness // "null"' /tmp/bench-result.json)
148+
echo "fitness=${fitness}" >> "$GITHUB_OUTPUT"
149+
# Compact JSON for the check-run output below.
150+
echo "result_json=$(jq -c . /tmp/bench-result.json)" >> "$GITHUB_OUTPUT"
151+
152+
- name: Upload benchmark result
153+
if: steps.program.outputs.has_evaluator == 'true'
154+
uses: actions/upload-artifact@v4
155+
with:
156+
name: benchmark-result
157+
path: /tmp/bench-result.json
158+
159+
- name: Attach fitness as check-run
160+
if: steps.program.outputs.has_evaluator == 'true'
161+
uses: actions/github-script@v7
162+
env:
163+
FITNESS: ${{ steps.bench.outputs.fitness }}
164+
RESULT_JSON: ${{ steps.bench.outputs.result_json }}
165+
with:
166+
script: |
167+
const fitness = process.env.FITNESS;
168+
let result;
169+
try {
170+
result = JSON.parse(process.env.RESULT_JSON);
171+
} catch {
172+
result = { raw: process.env.RESULT_JSON };
173+
}
174+
const sha = context.payload.pull_request
175+
? context.payload.pull_request.head.sha
176+
: context.sha;
177+
await github.rest.checks.create({
178+
...context.repo,
179+
name: "OpenEvolve benchmark",
180+
head_sha: sha,
181+
status: "completed",
182+
conclusion: fitness === "null" ? "neutral" : "success",
183+
output: {
184+
title: `fitness=${fitness}`,
185+
summary: "```json\n" + JSON.stringify(result, null, 2) + "\n```",
186+
},
187+
});

0 commit comments

Comments
 (0)