|
| 1 | +# Autonomous Performance Tuning via the Algorithm Engineering Cycle |
| 2 | + |
| 3 | +This program turns an autonomous agent into a performance engineer that follows the Algorithm Engineering (AE) methodology: a cycle of design, analysis, implementation, and experimental evaluation, driven by falsifiable hypotheses (cf. Sanders, "Algorithm Engineering - An Attempt at a Definition"). |
| 4 | + |
| 5 | +``` |
| 6 | + ┌─────────────────┐ |
| 7 | + │ Realistic Model │ |
| 8 | + │ of the system │ |
| 9 | + └───────┬─────────┘ |
| 10 | + │ |
| 11 | + ▼ |
| 12 | + ┌──────────┐ ┌────────┐ ┌─────────────┐ |
| 13 | + │ Analysis │◄──│ Design │──►│ Falsifiable │ |
| 14 | + │ │ │ │ │ Hypothesis │ |
| 15 | + └────┬─────┘ └────────┘ └──────┬──────┘ |
| 16 | + │ (induction) │ |
| 17 | + │ ◄────────────────────── │ |
| 18 | + ▼ ▼ |
| 19 | + ┌────────────────┐ ┌──────────────┐ |
| 20 | + │ Performance │ │ Implement │ |
| 21 | + │ Guarantees │ │ the change │ |
| 22 | + │ (deduction) │ └──────┬───────┘ |
| 23 | + └────────────────┘ │ |
| 24 | + ▼ |
| 25 | + ┌─────────────────────────────────────┐ |
| 26 | + │ Experiment │ |
| 27 | + │ Run, measure, compare to baseline │ |
| 28 | + └──────────────┬──────────────────────┘ |
| 29 | + │ |
| 30 | + ┌───────▼───────┐ |
| 31 | + │ Evaluate │ |
| 32 | + │ keep/discard │ |
| 33 | + └───────┬───────┘ |
| 34 | + │ |
| 35 | + (loop back to Design) |
| 36 | +``` |
| 37 | + |
| 38 | +## Overview |
| 39 | + |
| 40 | +The agent receives: |
| 41 | +- A **target program** to optimize (the code under test). |
| 42 | +- A **metric** to improve (e.g. latency, throughput, val_bpb, memory usage). |
| 43 | +- A **benchmark harness** that produces reproducible measurements. |
| 44 | +- A **time budget** per experiment. |
| 45 | + |
| 46 | +The agent then runs an autonomous loop, applying the AE cycle to systematically improve performance. Each iteration produces a falsifiable hypothesis ("changing X will improve metric Y by roughly Z"), implements it, runs the experiment, and either keeps or discards the change based on the result. |
| 47 | + |
| 48 | +## Setup |
| 49 | + |
| 50 | +To set up a new tuning session, work with the user to: |
| 51 | + |
| 52 | +1. **Agree on a run tag**: propose a tag based on today's date (e.g. `mar21`). The branch `autotune/<tag>` must not already exist. |
| 53 | +2. **Create the branch**: `git checkout -b autotune/<tag>` from current main/master. |
| 54 | +3. **Read the in-scope files**: Understand the full context: |
| 55 | + - The **target file(s)** the agent may modify. |
| 56 | + - The **benchmark harness** (read-only) that defines the metric and evaluation procedure. |
| 57 | + - Any **configuration or constants** that are fixed. |
| 58 | +4. **Identify constraints**: Understand what the agent can and cannot change (see below). |
| 59 | +5. **Verify benchmark works**: Run the benchmark once to confirm it produces output. |
| 60 | +6. **Initialize results.tsv**: Create `results.tsv` with just the header row. |
| 61 | +7. **Confirm and go**: Confirm setup looks good with the user, then begin. |
| 62 | + |
| 63 | +## Constraints |
| 64 | + |
| 65 | +The user defines these per project. The agent must respect them strictly. |
| 66 | + |
| 67 | +**What the agent CAN do:** |
| 68 | +- Modify the designated target file(s). Everything within them is fair game: algorithms, data structures, parameters, control flow, memory layout, parallelism, etc. |
| 69 | + |
| 70 | +**What the agent CANNOT do:** |
| 71 | +- Modify the benchmark harness or evaluation code. |
| 72 | +- Install new packages or add dependencies beyond what is already available. |
| 73 | +- Modify the metric definition or measurement methodology. |
| 74 | +- Change the time budget or input data. |
| 75 | + |
| 76 | +## The AE Cycle in Practice |
| 77 | + |
| 78 | +Each iteration of the loop implements one full AE cycle: |
| 79 | + |
| 80 | +### 1. Realistic Model (Understand the System) |
| 81 | + |
| 82 | +Before proposing changes, the agent must have an accurate mental model of: |
| 83 | +- The target program's architecture and hot paths. |
| 84 | +- The hardware/runtime environment (CPU, GPU, memory hierarchy, I/O). |
| 85 | +- Where time and resources are actually spent (profiling data, prior experiments). |
| 86 | +- Which assumptions from previous iterations still hold. |
| 87 | + |
| 88 | +The model should be updated after every experiment. When experiments contradict expectations, the model is wrong and must be revised, not the experiment. |
| 89 | + |
| 90 | +### 2. Design (Formulate a Hypothesis) |
| 91 | + |
| 92 | +Propose a specific, falsifiable hypothesis. Good hypotheses are: |
| 93 | +- **Specific**: "Increasing batch size from 32 to 64 will improve throughput by ~15% because the GPU is underutilized" (not "make it faster"). |
| 94 | +- **Falsifiable**: There must be a conceivable experimental outcome that would disprove it. |
| 95 | +- **Grounded**: Based on the current model, prior experimental results, or established algorithmic knowledge. |
| 96 | +- **Minimal**: Change one thing at a time when possible, to isolate effects. |
| 97 | + |
| 98 | +Hypotheses can come from: |
| 99 | +- **Induction**: Patterns observed in prior experiments (e.g. "larger batches consistently helped, so an even larger batch might help further"). |
| 100 | +- **Creative insight**: Novel algorithmic ideas, known techniques from the literature, or architectural redesigns. |
| 101 | +- **Analysis**: Deductive reasoning about algorithmic complexity, cache behavior, parallelism, etc. |
| 102 | + |
| 103 | +### 3. Analysis (Predict Before Measuring) |
| 104 | + |
| 105 | +Before implementing, reason about the expected outcome: |
| 106 | +- What is the expected direction and magnitude of improvement? |
| 107 | +- What could go wrong (OOM, numerical instability, regression on other metrics)? |
| 108 | +- Is the complexity cost worth the expected gain? (Simplicity criterion: all else equal, simpler is better.) |
| 109 | + |
| 110 | +This step prevents wasted experiments and builds understanding even when hypotheses fail. |
| 111 | + |
| 112 | +### 4. Implementation (Make the Change) |
| 113 | + |
| 114 | +Implement the change in the target file(s). Principles: |
| 115 | +- **Minimal diff**: Change only what the hypothesis requires. |
| 116 | +- **Correctness first**: The program must still produce correct results. |
| 117 | +- **Commit before running**: Every experiment has a corresponding git commit, so changes are traceable and revertable. |
| 118 | + |
| 119 | +### 5. Experiment (Run and Measure) |
| 120 | + |
| 121 | +Run the benchmark and collect results: |
| 122 | +- Redirect all output to a log file to avoid flooding the agent's context. |
| 123 | +- Extract the key metric(s) from the log. |
| 124 | +- If the run crashes, diagnose from the log tail. |
| 125 | + |
| 126 | +### 6. Evaluate (Keep or Discard) |
| 127 | + |
| 128 | +Compare the result to the current best: |
| 129 | +- **Improvement**: Keep the change. The branch advances. |
| 130 | +- **No improvement or regression**: Discard. Git reset to the previous best. |
| 131 | +- **Crash**: Attempt a quick fix if it's trivial (typo, missing import). Otherwise log as crash and move on. |
| 132 | + |
| 133 | +Apply the **simplicity criterion** when evaluating: |
| 134 | +- A tiny improvement that adds significant complexity is not worth keeping. |
| 135 | +- An equal result with simpler code is a win; keep it. |
| 136 | +- Removing code while maintaining performance is a great outcome. |
| 137 | + |
| 138 | +## Output Format |
| 139 | + |
| 140 | +The benchmark should print a summary with at least the primary metric. The agent extracts key values via grep. Example: |
| 141 | + |
| 142 | +``` |
| 143 | +grep "^primary_metric:" run.log |
| 144 | +``` |
| 145 | + |
| 146 | +## Logging Results |
| 147 | + |
| 148 | +Every experiment is logged to `results.tsv` (tab-separated). The TSV has a header row and columns: |
| 149 | + |
| 150 | +``` |
| 151 | +commit metric_value resource_usage status hypothesis |
| 152 | +``` |
| 153 | + |
| 154 | +1. **commit**: git commit hash (short, 7 chars) |
| 155 | +2. **metric_value**: the primary metric achieved (use 0.000000 for crashes) |
| 156 | +3. **resource_usage**: peak resource consumption, e.g. memory in GB (use 0.0 for crashes) |
| 157 | +4. **status**: `keep`, `discard`, or `crash` |
| 158 | +5. **hypothesis**: the falsifiable hypothesis that motivated this experiment |
| 159 | + |
| 160 | +Example: |
| 161 | + |
| 162 | +``` |
| 163 | +commit metric_value resource_usage status hypothesis |
| 164 | +a1b2c3d 0.997900 44.0 keep baseline |
| 165 | +b2c3d4e 0.993200 44.2 keep doubling LR will reduce val_bpb because current LR undershoots the loss basin |
| 166 | +c3d4e5f 1.005000 44.0 discard GeLU activation will improve gradient flow in early layers |
| 167 | +d4e5f6g 0.000000 0.0 crash doubling model width will improve capacity (OOM) |
| 168 | +``` |
| 169 | + |
| 170 | +Note: do not commit `results.tsv`; leave it untracked. |
| 171 | + |
| 172 | +## The Experiment Loop |
| 173 | + |
| 174 | +The experiment runs on a dedicated branch (e.g. `autotune/mar21`). |
| 175 | + |
| 176 | +**The first run** is always the unmodified baseline. |
| 177 | + |
| 178 | +LOOP FOREVER (DO NOT STOP EVER): |
| 179 | + |
| 180 | +1. **Model**: Review the current state: branch, recent results, what has worked and what hasn't. Update your mental model. |
| 181 | +2. **Hypothesize**: Formulate a specific, falsifiable hypothesis for the next change. |
| 182 | +3. **Analyze**: Predict the expected outcome and assess whether the experiment is worth running. |
| 183 | +4. **Implement**: Modify the target file(s) according to the hypothesis. |
| 184 | +5. **Commit**: `git commit` the change. |
| 185 | +6. **Experiment**: Run the benchmark: `<run_command> > run.log 2>&1` (redirect everything; do NOT use tee or let output flood your context). |
| 186 | +7. **Measure**: Extract results: `grep "^<metric>:" run.log` |
| 187 | +8. **Evaluate**: |
| 188 | + - If grep output is empty, the run crashed. Run `tail -n 50 run.log` to read the error. Attempt a fix if trivial; otherwise give up on this hypothesis. |
| 189 | + - If the metric improved: keep the commit, advance the branch. |
| 190 | + - If the metric is equal or worse: `git reset` back to the previous best. |
| 191 | +9. **Log**: Record the result in `results.tsv`. |
| 192 | +10. **Reflect**: What did this result tell you? Update your model. Did it confirm, refute, or refine your hypothesis? Use induction from accumulated results to generate the next hypothesis. |
| 193 | +11. **Go to 1**. |
| 194 | + |
| 195 | +## Operational Rules |
| 196 | + |
| 197 | +**Timeout**: Each experiment should complete within the configured time budget (plus startup overhead). If a run exceeds 2x the budget, kill it and treat it as a failure. |
| 198 | + |
| 199 | +**Crashes**: Use judgment. Fix trivial bugs (typo, missing import) and re-run. If the idea is fundamentally broken, skip it, log "crash", and move on. |
| 200 | + |
| 201 | +**NEVER STOP**: Once the loop begins, do NOT pause to ask the user if you should continue. The user may be away and expects autonomous operation. If you run out of ideas, think harder: |
| 202 | +- Re-read the target code and benchmark for new angles. |
| 203 | +- Combine near-misses from previous experiments. |
| 204 | +- Try more radical architectural changes. |
| 205 | +- Revisit discarded ideas with modifications. |
| 206 | +- Look at algorithmic alternatives for the core computation. |
| 207 | + |
| 208 | +The loop runs until the user manually interrupts. |
| 209 | + |
| 210 | +**Simplicity criterion**: All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Removing code for equal or better results is a great outcome. Weigh complexity cost against improvement magnitude. |
| 211 | + |
| 212 | +**Resource usage** is a soft constraint. Some increase is acceptable for meaningful metric gains, but it should not blow up dramatically. |
| 213 | + |
| 214 | +## Philosophical Foundation |
| 215 | + |
| 216 | +This methodology mirrors Popper's scientific method as applied to algorithm engineering (Sanders, 2009): |
| 217 | + |
| 218 | +- **Falsifiable hypotheses** drive every experiment. "It might be faster" is not a hypothesis. "Replacing the O(n^2) inner loop with a hash-based lookup will reduce latency by ~40% for inputs >10k elements" is. |
| 219 | +- **Induction** from experiments feeds back into new hypotheses. Each result, whether positive or negative, refines understanding. |
| 220 | +- **Reproducibility** is ensured by git commits, deterministic benchmarks, and logged results. |
| 221 | +- **The cycle never ends**: there is always another hypothesis to test, another angle to explore, another simplification to try. Algorithm engineering is not a linear process but a continuous spiral of refinement. |
0 commit comments