Skip to content

Commit c39ac73

Browse files
authored
Merge pull request #182 from githubnext/copilot/add-strategy-system-autoloop
Add strategy system to autoloop; ship AlphaEvolve as the first playbook
2 parents c117efc + 2897e6d commit c39ac73

13 files changed

Lines changed: 853 additions & 0 deletions

File tree

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# tsb-perf-evolve — code/
2+
3+
This directory holds the **fixed inputs** for the program: the benchmark scripts and a small config. The autoloop iterations should rarely touch these files. The thing that *evolves* is `src/core/series.ts` (specifically the `sortValues` method) — see `../program.md` for the full picture.
4+
5+
## Files
6+
7+
- `config.yaml` — tunables read by the AlphaEvolve playbook (`exploitation_ratio`, `num_islands`, `population_size`, `archive_size`, dataset size).
8+
- `benchmark.ts` — tsb-side benchmark. Builds a Series of `dataset_size` random floats with ~5% NaN, calls `sortValues` in a tight loop, prints `{"function": "Series.sortValues", "mean_ms": …, "iterations": …, "total_ms": …}`.
9+
- `benchmark.py` — pandas-side benchmark. Builds an equivalent `pd.Series`, calls `.sort_values()` in the same loop structure, prints the same JSON shape.
10+
11+
The two benchmarks must stay aligned: same dataset size, same NaN ratio, same warm-up + measured iteration counts. If you tweak one, tweak the other.
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
"""pandas-side benchmark for Series.sort_values.
2+
3+
Output: a single JSON line on stdout with the shape
4+
{"function": "Series.sort_values", "mean_ms": <number>,
5+
"iterations": <number>, "total_ms": <number>}
6+
7+
Dataset shape and iteration counts mirror ./benchmark.ts — keep the two in
8+
lockstep. Fixed seed for reproducibility across runs.
9+
"""
10+
11+
from __future__ import annotations
12+
13+
import json
14+
import sys
15+
import time
16+
17+
import numpy as np
18+
import pandas as pd
19+
20+
# Inlined from config.yaml (kept in sync with benchmark.ts).
21+
DATASET_SIZE = 100_000
22+
NAN_RATIO = 0.05
23+
WARMUP_ITERATIONS = 5
24+
MEASURED_ITERATIONS = 50
25+
RANDOM_SEED = 42
26+
27+
28+
def build_data() -> pd.Series:
29+
rng = np.random.default_rng(RANDOM_SEED)
30+
values = rng.uniform(-500_000.0, 500_000.0, size=DATASET_SIZE)
31+
nan_mask = rng.random(size=DATASET_SIZE) < NAN_RATIO
32+
values[nan_mask] = np.nan
33+
return pd.Series(values, dtype="float64")
34+
35+
36+
def main() -> None:
37+
series = build_data()
38+
39+
# Warm-up.
40+
for _ in range(WARMUP_ITERATIONS):
41+
series.sort_values()
42+
43+
start = time.perf_counter()
44+
for _ in range(MEASURED_ITERATIONS):
45+
series.sort_values()
46+
total_s = time.perf_counter() - start
47+
total_ms = total_s * 1000.0
48+
mean_ms = total_ms / MEASURED_ITERATIONS
49+
50+
result = {
51+
"function": "Series.sort_values",
52+
"mean_ms": mean_ms,
53+
"iterations": MEASURED_ITERATIONS,
54+
"total_ms": total_ms,
55+
}
56+
sys.stdout.write(json.dumps(result) + "\n")
57+
58+
59+
if __name__ == "__main__":
60+
main()
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
// tsb-side benchmark for Series.sortValues.
2+
// Output: a single JSON line on stdout with the shape
3+
// {"function": "Series.sortValues", "mean_ms": <number>, "iterations": <number>, "total_ms": <number>}
4+
//
5+
// Dataset shape and iteration counts come from ./config.yaml — keep this file
6+
// and ./benchmark.py in lockstep.
7+
8+
import { Series } from "../../../../src/index.ts";
9+
10+
// Inlined from config.yaml — the autoloop agent should keep these in sync.
11+
// (No YAML parser dependency to keep this benchmark hermetic.)
12+
const DATASET_SIZE = 100_000;
13+
const NAN_RATIO = 0.05;
14+
const WARMUP_ITERATIONS = 5;
15+
const MEASURED_ITERATIONS = 50;
16+
const RANDOM_SEED = 42;
17+
18+
// A tiny deterministic PRNG (mulberry32). Note: this is *not* the same
19+
// algorithm as numpy's default_rng on the Python side, so for any given seed
20+
// the two benchmarks will see different concrete values. They will still see
21+
// the same *distribution* (uniform over [-500_000, 500_000) with the same NaN
22+
// fraction), and that is what matters for a sorting micro-benchmark — the
23+
// dataset shape, not the exact bit pattern. If you ever need byte-identical
24+
// inputs across the two sides, swap mulberry32 for a portable PRNG that has a
25+
// matching numpy implementation (e.g. PCG64).
26+
function mulberry32(seed: number): () => number {
27+
let a = seed >>> 0;
28+
return () => {
29+
a = (a + 0x6d2b79f5) >>> 0;
30+
let t = a;
31+
t = Math.imul(t ^ (t >>> 15), t | 1);
32+
t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
33+
return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
34+
};
35+
}
36+
37+
function buildData(): readonly (number | null)[] {
38+
const rng = mulberry32(RANDOM_SEED);
39+
const out: (number | null)[] = new Array(DATASET_SIZE);
40+
for (let i = 0; i < DATASET_SIZE; i++) {
41+
out[i] = rng() < NAN_RATIO ? null : rng() * 1_000_000 - 500_000;
42+
}
43+
return out;
44+
}
45+
46+
function nowMs(): number {
47+
return performance.now();
48+
}
49+
50+
function main(): void {
51+
const data = buildData();
52+
const series = new Series<number | null>({ data, dtype: "float64" });
53+
54+
// Warm-up — let the JIT specialize.
55+
for (let i = 0; i < WARMUP_ITERATIONS; i++) {
56+
series.sortValues();
57+
}
58+
59+
const start = nowMs();
60+
for (let i = 0; i < MEASURED_ITERATIONS; i++) {
61+
series.sortValues();
62+
}
63+
const totalMs = nowMs() - start;
64+
const meanMs = totalMs / MEASURED_ITERATIONS;
65+
66+
const result = {
67+
function: "Series.sortValues",
68+
mean_ms: meanMs,
69+
iterations: MEASURED_ITERATIONS,
70+
total_ms: totalMs,
71+
};
72+
process.stdout.write(`${JSON.stringify(result)}\n`);
73+
}
74+
75+
main();
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# AlphaEvolve tunables — read by strategy/alphaevolve.md every iteration.
2+
3+
# Operator weights. Must sum to 1.0. Defaults bias toward exploitation.
4+
exploitation_ratio: 0.50
5+
exploration_ratio: 0.30
6+
crossover_ratio: 0.15
7+
migration_ratio: 0.05
8+
9+
# Island count. Should match the number of islands enumerated in
10+
# strategy/alphaevolve.md's "Pick parent(s)" section.
11+
num_islands: 5
12+
13+
# MAP-Elites population caps.
14+
population_size: 40
15+
archive_size: 10
16+
17+
# Benchmark dataset shape. Both benchmark.ts and benchmark.py read this.
18+
dataset_size: 100000
19+
nan_ratio: 0.05
20+
warmup_iterations: 5
21+
measured_iterations: 50
22+
random_seed: 42
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
schedule: every 6h
3+
---
4+
5+
# tsb perf evolve — Series.sortValues vs pandas Series.sort_values
6+
7+
## Goal
8+
9+
Evolve the implementation of `Series.sortValues` (`src/core/series.ts`) so that, on the synthetic benchmark in `code/benchmark.ts`, tsb runs **at least as fast as pandas** on the equivalent `Series.sort_values` call (`code/benchmark.py`).
10+
11+
Concretely, we minimize the **ratio**
12+
13+
fitness = mean_ms_tsb / mean_ms_pandas
14+
15+
`fitness < 1.0` means tsb is faster than pandas; lower is better. We will keep iterating as long as fitness keeps improving.
16+
17+
This is a **performance-evolution program** — there is one self-contained artifact (`Series.sortValues`), one scalar fitness (the ratio), and many plausible algorithmic families to try (comparison sort, typed-array indirect sort, dtype-dispatched non-comparison sort, batched/SoA, etc.). It is the canonical case for the AlphaEvolve strategy.
18+
19+
### Validity invariants
20+
21+
A candidate is valid iff:
22+
23+
1. The existing test suite for `sortValues` passes: `bun test tests/core/series.sortValues.test.ts` (and any property tests that exercise it).
24+
2. The function signature is unchanged: `sortValues(ascending = true, naPosition: "first" | "last" = "last"): Series<T>`.
25+
3. No new runtime dependencies (devDependencies for benchmarking are fine).
26+
4. TypeScript strict mode is satisfied — no `any`, no `as` casts, no `@ts-ignore`.
27+
5. Behaviour is identical to the current implementation for: numeric (with NaN), string, mixed dtypes, ascending and descending, both `naPosition` values, and an empty Series.
28+
29+
The evaluator runs the test suite and the benchmark; if either fails, the candidate is rejected.
30+
31+
## Target
32+
33+
Only modify these files:
34+
- `src/core/series.ts` — the `sortValues` method body (and any small private helpers inside `series.ts` that it calls). Keep the public signature unchanged.
35+
- `.autoloop/programs/tsb-perf-evolve/code/**` — benchmark scripts and config. (You will rarely need to touch these — the evaluator is fixed; the benchmark dataset is fixed; only tweak if a candidate genuinely needs a new bench scenario.)
36+
37+
Do NOT modify:
38+
- `tests/**` — test files (they are the validity oracle; do not weaken them).
39+
- `README.md` — read-only.
40+
- `.autoloop/programs/**` other than this program's `code/` dir.
41+
- `.github/workflows/autoloop*` — autoloop workflow files.
42+
- Any `src/**` file other than `src/core/series.ts`.
43+
44+
## Evolution Strategy
45+
46+
This program uses the **AlphaEvolve** strategy. On every iteration, read `strategy/alphaevolve.md` and follow it literally — it supersedes the generic analyze/accept/reject steps in the default autoloop loop.
47+
48+
Support files:
49+
- `strategy/alphaevolve.md` — the runtime playbook (operators, parent selection, population rules).
50+
- `strategy/prompts/mutation.md` — framing for exploitation and exploration operators.
51+
- `strategy/prompts/crossover.md` — framing for crossover and migration operators.
52+
53+
Population state lives in the state file on the `memory/autoloop` branch under the `## 🧬 Population` subsection (see the playbook for the schema).
54+
55+
## Evaluation
56+
57+
```bash
58+
set -euo pipefail
59+
60+
# 1. Validity — existing tests for sortValues must still pass.
61+
bun test tests/core/series.sortValues.test.ts >/tmp/perf-evolve-tests.log 2>&1 || {
62+
echo '{"fitness": null, "rejected_reason": "tests failed"}'
63+
exit 0
64+
}
65+
66+
# 2. Benchmark — tsb side.
67+
tsb_ms=$(bun run .autoloop/programs/tsb-perf-evolve/code/benchmark.ts | python3 -c "import json,sys; print(json.load(sys.stdin)['mean_ms'])")
68+
69+
# 3. Benchmark — pandas side. Skip gracefully if pandas isn't available.
70+
if ! python3 -c 'import pandas' 2>/dev/null; then
71+
pip3 install pandas --quiet 2>/dev/null || true
72+
fi
73+
pd_ms=$(python3 .autoloop/programs/tsb-perf-evolve/code/benchmark.py | python3 -c "import json,sys; print(json.load(sys.stdin)['mean_ms'])")
74+
75+
# 4. Fitness = ratio. Lower is better.
76+
ratio=$(python3 -c "print(${tsb_ms} / ${pd_ms})")
77+
echo "{\"fitness\": ${ratio}, \"tsb_mean_ms\": ${tsb_ms}, \"pandas_mean_ms\": ${pd_ms}}"
78+
```
79+
80+
The metric is `fitness` (= `tsb_mean_ms / pandas_mean_ms`). **Lower is better.** A value below `1.0` means tsb is now faster than pandas on this workload.
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# AlphaEvolve Strategy — tsb-perf-evolve
2+
3+
This file is the **runtime playbook** for this program. The autoloop agent reads it at the start of every iteration and follows it literally. It supersedes the generic "Analyze and Propose" / "Accept or Reject" steps in the default autoloop iteration loop — all other steps (state read, branch management, state file updates) still apply.
4+
5+
## Problem framing
6+
7+
The target artifact is the body of `Series.sortValues` in `src/core/series.ts`. Fitness is the ratio `tsb_mean_ms / pandas_mean_ms` measured on the fixed benchmark in `code/benchmark.ts` (and its pandas mirror `code/benchmark.py`); **lower is better**, with `< 1.0` meaning tsb is faster than pandas. A candidate is valid iff the existing tests for `sortValues` pass, the public signature is unchanged, no new runtime dependencies are added, TypeScript strict mode is satisfied, and behaviour matches the reference for numeric/string/mixed dtypes, both ascending values, and both `naPosition` settings.
8+
9+
## Per-iteration loop
10+
11+
### Step 1. Load state
12+
13+
1. Read `program.md` — Goal, Target, Evaluation.
14+
2. Read the program's state file from the repo-memory folder (`tsb-perf-evolve.md`). Locate the `## 🧬 Population` subsection. If it does not exist, create it using the schema in [Population schema](#population-schema).
15+
3. Read `code/config.yaml` for tunables (`exploitation_ratio`, `num_islands`, `population_size`, `archive_size`, `dataset_size`, etc.). Do not hard-code values you can read from config — the maintainer may have tuned them.
16+
4. Read both prompt templates in `strategy/prompts/`. These frame how you reason about mutations and crossovers for sorting code.
17+
18+
### Step 2. Pick operator
19+
20+
Sample one operator using these weights (tuned for a perf problem with a small handful of plausible algorithmic families — exploitation-heavy because once an island has a working candidate, refinement usually pays):
21+
22+
| Operator | Default weight | When it fires |
23+
|---|---|---|
24+
| Exploitation | 0.50 | Refine one of the elites — the current best or a near-best. |
25+
| Exploration | 0.30 | Generate a candidate from an **under-represented island** or a novel family. |
26+
| Crossover | 0.15 | Combine ideas from two parents on different islands. |
27+
| Migration | 0.05 | Take a technique that works on island A and port it into a solution on island B. |
28+
29+
Deterministic overrides (apply *before* sampling):
30+
31+
- If the population is empty or has one member → **Exploration** (seed diversity).
32+
- If the last 3 statuses in `recent_statuses` are all `rejected` → force **Exploration** with a previously-unused island.
33+
- If the last 5 statuses are all `rejected` → force **Migration** or a radically new island; also revisit any domain knowledge in `prompts/mutation.md` that has not yet been applied.
34+
35+
Record your chosen operator in the iteration's reasoning — the state file's Iteration History entry must include it.
36+
37+
### Step 3. Pick parent(s)
38+
39+
**Islands** for this program (algorithmic families for sorting a 1-D numeric Series with NaN):
40+
41+
- **Island 0 — Comparison sort (objects)**: the current implementation — `Array.prototype.sort` over `{v, i}` pairs with a comparator that handles NaN.
42+
- **Island 1 — Indirect typed-array sort**: copy values into a `Float64Array`, sort an index `Uint32Array` by that, then gather. NaN handled by partition.
43+
- **Island 2 — Decorate-sort-undecorate with packed keys**: encode `(value, index)` into a single sortable representation (e.g. pack into a `BigInt64Array` or use parallel typed arrays), sort once, gather.
44+
- **Island 3 — Non-comparison / radix**: dispatch on dtype; for finite floats, transform to a sortable unsigned representation and run an LSD radix sort, then untransform.
45+
- **Island 4 — Hybrid**: small-input fast path (Array.prototype.sort) + large-input dispatch into one of the above families based on `dataset_size` and dtype.
46+
47+
Parent selection by operator:
48+
49+
- **Exploitation** — pick the best scorer; break ties by picking the most recent.
50+
- **Exploration** — pick the island with the fewest members (or a brand-new island number if all are full), then either start from its best member or from scratch.
51+
- **Crossover** — pick two parents on **different islands**. Bias toward one elite (top quartile) and one diverse (any island with a distinct feature-cell — see [Feature dimensions](#feature-dimensions)).
52+
- **Migration** — pick one donor island (the source of the technique) and one recipient island (where the technique will be grafted in). The parent you actually edit is on the recipient island.
53+
54+
### Step 4. Apply the operator
55+
56+
Frame your reasoning using the matching prompt template:
57+
58+
- Exploitation or Exploration → `strategy/prompts/mutation.md`
59+
- Crossover or Migration → `strategy/prompts/crossover.md`
60+
61+
Before writing any code, state (in your visible reasoning):
62+
63+
1. Chosen operator + why.
64+
2. Parent(s) picked — their IDs, island, score, and a one-line summary of each parent's approach.
65+
3. What specifically you're changing, and your hypothesis for *why* it should improve the fitness.
66+
4. Validity pre-check — walk through why the proposed candidate will satisfy each invariant:
67+
- Existing tests for `sortValues` will pass (numeric + NaN, string, ascending/descending, both `naPosition` values, empty Series).
68+
- Public signature unchanged: `sortValues(ascending = true, naPosition: "first" | "last" = "last"): Series<T>`.
69+
- No new runtime dependency added to `package.json`.
70+
- No `any`, no `as`, no `@ts-ignore`.
71+
- Index alignment preserved — every output value is paired with the original index of the input row it came from.
72+
5. Novelty check: confirm this is not a near-duplicate of an existing population member or of anything in the state file's 🚧 Foreclosed Avenues.
73+
74+
### Step 5. Implement
75+
76+
Edit only the files listed in `program.md`'s Target section. The diff style for this program is **minimal diff**`series.ts` is a large file and only the body of `sortValues` (plus, occasionally, a small private helper added immediately above it) should change. Do not reformat unrelated parts of the file.
77+
78+
### Step 6. Evaluate
79+
80+
Run the evaluation command from `program.md`. Parse the `fitness` field from the JSON output (along with `tsb_mean_ms` and `pandas_mean_ms` for the population entry).
81+
82+
### Step 7. Update the population
83+
84+
Regardless of whether the iteration is accepted or rejected at the branch level, the candidate has been tried and should be recorded in the population — the population is a memory of what's been explored, not just what's been kept.
85+
86+
Append a new entry to the `## 🧬 Population` subsection in the state file using the schema below. Then enforce these caps:
87+
88+
- **Population cap**: `population_size` from `code/config.yaml` (default 40). If exceeded, evict the *worst* member in the most-crowded feature cell (MAP-Elites style — never evict the best of any cell).
89+
- **Elite archive**: the top `archive_size` from `code/config.yaml` (default 10) by fitness are always preserved regardless of cell crowding.
90+
91+
### Step 8. Fold through to the default loop
92+
93+
Continue with the normal autoloop Step 5 (Accept or Reject → commit / discard, update state file's Machine State, Iteration History, Lessons Learned, etc.) as defined in the workflow. The only additional requirements from AlphaEvolve are:
94+
95+
- The Iteration History entry must include `operator`, `parent_id(s)`, `island`, and `fitness` fields (in addition to the normal status/change/metric/notes).
96+
- Lessons Learned additions should be phrased as *transferable heuristics* about the problem space, not as reports of what this iteration did. (E.g. "Indirect sort over `Uint32Array` indices beats object-pair sort above n≈10k" — not "Iteration 17 tried indirect sort.")
97+
98+
## Feature dimensions
99+
100+
MAP-Elites partitions the population into **feature cells**. Each candidate is described by a small tuple of qualitative features, and the population keeps the best candidate per cell — this is what creates diversity pressure even when many candidates have similar fitness.
101+
102+
For this program, use these feature dimensions:
103+
104+
- **Dimension 1 — Storage**: `boxed-pairs` / `parallel-typed-arrays` / `packed-typed-array` / `wasm-buffer`
105+
- **Dimension 2 — Algorithm class**: `comparison` / `non-comparison` / `hybrid`
106+
107+
When evaluating a candidate, classify it into one cell per dimension. The combined `(storage, algorithm)` tuple is its **feature cell**. Record the cell in the population entry (see schema).
108+
109+
## Population schema
110+
111+
The population lives in the state file `tsb-perf-evolve.md` on the `memory/autoloop` branch as a subsection. Use this exact layout so maintainers can read and edit it:
112+
113+
```markdown
114+
## 🧬 Population
115+
116+
> 🤖 *Managed by the AlphaEvolve strategy. One entry per candidate that has been evaluated (accepted or rejected). Newest first.*
117+
118+
### Candidate <id> · island <n> · fitness <score> · gen <iter>
119+
120+
- **Operator**: exploitation / exploration / crossover / migration
121+
- **Parent(s)**: [<id1>, <id2>]
122+
- **Feature cell**: <storage-bucket> · <algorithm-bucket>
123+
- **Approach**: <one-line summary of the technique>
124+
- **Status**: ✅ accepted / ❌ rejected
125+
- **Notes**: <what worked or didn't, anything worth remembering — e.g. "tsb=12.3ms / pandas=8.7ms / ratio=1.41">
126+
127+
Code:
128+
129+
\`\`\`typescript
130+
<the candidate sortValues body, or a diff against parent if too large to inline>
131+
\`\`\`
132+
133+
---
134+
```
135+
136+
Identifiers:
137+
- `<id>` is `c{NNN}` zero-padded, monotonically increasing across the program's lifetime.
138+
- `<n>` is the island number (0-indexed, 0..4 for this program).
139+
- `<score>` is the raw `fitness` (the tsb/pandas ms ratio).
140+
- `<iter>` is the iteration number from the Machine State table.
141+
142+
When evicting members under the population cap, **never** delete an entry — instead, prepend a strikethrough header (`### ~~Candidate c042~~ (evicted, gen 87)`) and remove the entire `Code:` block (both the `Code:` label and the surrounding triple-backtick `typescript` code fence) to keep the file size bounded. The metadata stays so future iterations can see what was tried.

0 commit comments

Comments
 (0)