| layout | default | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| title | Chapter 8: Customization and Scaling | ||||||||
| nav_order | 8 | ||||||||
| parent | autoresearch Tutorial | ||||||||
| format_version | v2 | ||||||||
| why | The baseline autoresearch configuration assumes an H100 with 80 GB VRAM. Most practitioners have smaller GPUs, different operating systems, or want to run multiple agents in parallel. This chapter explains how to adapt every layer of the system for your actual hardware. | ||||||||
| mental_model | autoresearch is parameterized by hardware: model size, batch size, and gradient accumulation can all be tuned to fit any GPU. The key invariant is TIME_BUDGET=300s — everything else is negotiable. | ||||||||
| learning_outcomes |
|
||||||||
| snapshot |
|
||||||||
| chapter_map |
|
||||||||
| sources |
The reference configuration in train.py is tuned for a single H100 SXM 80 GB. Running
it as-is on:
- An RTX 3090 (24 GB): runs out of memory immediately
- An A10 (24 GB): runs out of memory immediately
- A MacBook with M3 Max: Flash Attention 3 is not available
- A Windows machine: path and library issues
- Two H100s: only uses one GPU
This chapter provides concrete modifications for each scenario. The guiding principle: TIME_BUDGET=300s is sacred. Everything else can be changed.
GPU memory is the binding constraint. Here is how to calculate the correct configuration for a given GPU:
def estimate_memory_gb(n_layer, n_embd, n_head, block_size, batch_size, grad_accum):
"""
Rough estimate of GPU memory for training.
Accounts for: parameters, optimizer states, activations, KV cache.
"""
# Parameters (float32 in optimizer, bf16 in forward)
params = n_layer * (
4 * n_embd * n_embd + # Q, K, V, O projections
8 * n_embd * n_embd # MLP (4× hidden × 2 matrices)
) + n_embd * 50257 # embedding + LM head
param_gb = params * 4 / 1e9 # float32
# Optimizer states (AdamW: 2× params, Muon: 1× params)
optimizer_gb = param_gb * 2.0
# Activations: roughly 12 * n_layer * batch_size * block_size * n_embd bytes (bf16)
activation_gb = 12 * n_layer * batch_size * block_size * n_embd * 2 / 1e9
# KV cache during training: 2 * n_layer * n_kv_head * block_size * head_dim * batch_size
head_dim = n_embd // n_head
n_kv_head = max(1, n_head // 3) # assuming GQA with 3× reduction
kv_gb = 2 * n_layer * n_kv_head * block_size * head_dim * batch_size * 2 / 1e9
total = param_gb + optimizer_gb + activation_gb + kv_gb
return total, {
'params': param_gb, 'optimizer': optimizer_gb,
'activations': activation_gb, 'kv_cache': kv_gb
}
# Example: check if a config fits in 24 GB
total, breakdown = estimate_memory_gb(
n_layer=8, n_embd=512, n_head=8,
block_size=512, batch_size=4, grad_accum=4
)
print(f"Estimated memory: {total:.1f} GB")
for k, v in breakdown.items():
print(f" {k}: {v:.1f} GB")graph TD
GPU{GPU VRAM} -->|80 GB H100| H100[H100 Config<br/>n_layer=12, n_embd=768<br/>block_size=1024, batch=8]
GPU -->|40 GB A100| A100[A100 Config<br/>n_layer=10, n_embd=640<br/>block_size=1024, batch=4]
GPU -->|24 GB RTX 4090| RTX4090[RTX 4090 Config<br/>n_layer=8, n_embd=512<br/>block_size=512, batch=4]
GPU -->|16 GB RTX 4080| RTX4080[RTX 4080 Config<br/>n_layer=6, n_embd=384<br/>block_size=512, batch=2]
GPU -->|8 GB RTX 3070| RTX3070[RTX 3070 Config<br/>n_layer=4, n_embd=256<br/>block_size=256, batch=2]
GPU -->|Apple MPS| MPS[M-Series Config<br/>n_layer=6, n_embd=384<br/>No FA3, block_size=512]
# train.py modifications for RTX 4090
@dataclass
class GPTConfig:
vocab_size: int = 50257
block_size: int = 512 # ↓ from 1024 (memory)
n_layer: int = 8 # ↓ from 12
n_head: int = 8 # ↓ from 12
n_kv_head: int = 2 # ↓ from 4 (more aggressive GQA)
n_embd: int = 512 # ↓ from 768
WINDOW_PATTERN: str = "SSSL"
SHORT_WINDOW: int = 64 # ↓ from 128 (scales with block_size)
use_value_residual: bool = True
dropout: float = 0.0
logit_softcap: float = 15.0
use_squared_relu: bool = True
# Training constants for RTX 4090
BATCH_SIZE = 8 # physical micro-batch
GRAD_ACCUM_STEPS = 8 # logical batch = 64 sequences × 512 tokens = 32768 tokens
TIME_BUDGET = 300 # NEVER CHANGE THISFlash Attention 3 is CUDA-only. For MPS, use PyTorch's built-in scaled_dot_product_attention:
# train.py modifications for Apple MPS
import torch
# Detect device
if torch.cuda.is_available():
device = torch.device('cuda')
elif torch.backends.mps.is_available():
device = torch.device('mps')
else:
device = torch.device('cpu')
# Replace Flash Attention 3 with SDPA
class CausalSelfAttentionMPS(nn.Module):
def forward(self, x, x0, cos, sin):
# ... (same Q, K, V projection, RoPE, QK-norm as before) ...
# Use PyTorch SDPA instead of flash_attn
# MPS supports SDPA with causal mask
attn_output = torch.nn.functional.scaled_dot_product_attention(
q, k, v,
attn_mask=None,
is_causal=True,
# Note: sliding window not natively supported on MPS
# Use full attention for all layers on MPS
)
return attn_outputMPS-specific changes:
- Replace
flash_attn_varlen_funcwithF.scaled_dot_product_attention - Remove the sliding window for S-layers (MPS SDPA does not support window_size)
- Use
torch.float32instead oftorch.bfloat16(MPS bfloat16 support is partial) - Reduce batch size and model size (MPS unified memory is slower than CUDA HBM)
# pyproject.toml for MPS
[project]
dependencies = [
"torch>=2.2.0", # remove ==2.9.1 CUDA requirement
# remove flash-attn (CUDA only)
"rustbpe",
"tiktoken",
"pyarrow",
"huggingface-hub",
"numpy",
]Windows requires a few path and library adjustments:
# Fix path separators in prepare.py
import pathlib
DATA_DIR = pathlib.Path("data") # not str "data/" — use pathlib throughout
# Fix multiprocessing for Windows
if __name__ == '__main__':
# Required on Windows to avoid fork issues with multiprocessing
torch.multiprocessing.set_start_method('spawn', force=True)
main()Flash Attention 3 on Windows requires WSL2 or a native CUDA build with specific Visual Studio toolchain. The community has maintained a WSL2 setup guide in the GitHub discussions.
For AMD GPUs (MI250X, MI300X, RX 7900 XTX):
# Install ROCm-compatible PyTorch
pip install torch --index-url https://download.pytorch.org/whl/rocm6.1# train.py: replace flash_attn with hipBLASLt-backed SDPA
# AMD GPUs support torch.nn.functional.scaled_dot_product_attention
# with flash attention implementation via ROCm
# The flash-attn package has a ROCm fork:
# pip install flash-attn-rocm (community maintained)
# Or use SDPA which is automatically accelerated on ROCm:
attn_output = torch.nn.functional.scaled_dot_product_attention(
q, k, v, is_causal=True
)For learning and experimentation on modest hardware, a "tiny" configuration:
# Tiny configuration — runs on any GPU with 8+ GB
@dataclass
class GPTConfig:
vocab_size: int = 50257
block_size: int = 256
n_layer: int = 4
n_head: int = 4
n_kv_head: int = 1 # MQA (multi-query attention)
n_embd: int = 256
WINDOW_PATTERN: str = "SL" # alternating short/full
SHORT_WINDOW: int = 32
use_value_residual: bool = False # disable for very small models
dropout: float = 0.0
logit_softcap: float = 15.0
use_squared_relu: bool = True
BATCH_SIZE = 4
GRAD_ACCUM_STEPS = 4This configuration uses ~2 GB peak memory and runs at ~200k tokens/second on an RTX 3070. It is suitable for validating experiment ideas before running the full configuration overnight.
For users with multiple GPUs (2× A100, 4× H100, etc.):
# train.py additions for DDP
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup_distributed():
"""Initialize the distributed process group."""
dist.init_process_group(backend='nccl')
rank = dist.get_rank()
world_size = dist.get_world_size()
torch.cuda.set_device(rank)
return rank, world_size
# Launch command:
# torchrun --nproc_per_node=4 train.py
rank, world_size = setup_distributed()
device = torch.device(f'cuda:{rank}')
model = GPT(config).to(device)
model = DDP(model, device_ids=[rank])
# In training loop: data is sharded across GPUs
# Each GPU processes a different micro-batch
# Gradients are automatically reduced across GPUs by DDP
# LR scales linearly with world_size (linear scaling rule)
max_lr = 3e-4 * world_size
# Effective batch size scales with world_size
effective_batch = BATCH_SIZE * GRAD_ACCUM_STEPS * world_sizeWith 4× H100:
- Effective batch size: 4× larger
- Throughput: ~3.8× (some communication overhead)
- Steps per 300s: ~3.8× more
- val_bpb typically 5–10% better than single GPU
autoresearch's branch-based design enables multiple agents to run simultaneously without conflicts:
graph TD
REPO[autoresearch repo] --> A1[Agent 1<br/>branch: autoresearch/architecture]
REPO --> A2[Agent 2<br/>branch: autoresearch/optimizer]
REPO --> A3[Agent 3<br/>branch: autoresearch/scaling]
A1 -->|modifies train.py| T1[train.py: architectural changes]
A2 -->|modifies train.py| T2[train.py: optimizer changes]
A3 -->|modifies train.py| T3[train.py: scaling changes]
T1 -->|appends| R1[results.tsv on agent 1 machine]
T2 -->|appends| R2[results.tsv on agent 2 machine]
T3 -->|appends| R3[results.tsv on agent 3 machine]
Because each agent works on its own branch and results.tsv is untracked, there are
zero conflicts between agents. In the morning, merge the insights:
# Collect all results
git fetch origin
git log --oneline origin/autoresearch/architecture | head -20
git log --oneline origin/autoresearch/optimizer | head -20
# Merge the best result into main
git checkout main
git merge origin/autoresearch/architecture # or whichever branch has the best val_bpb
# Or cherry-pick specific improvements
git cherry-pick <best_commit_from_each_branch>When running on different hardware, update program.md to include hardware-specific
constraints:
# autoresearch program
## Hardware Context
- GPU: RTX 4090 (24 GB VRAM)
- Current baseline: val_bpb=1.9234 (24 GB config)
- OOM threshold: memory_gb > 20 (leave 4 GB headroom)
## Hardware-Specific Rules
- If memory_gb > 20: git reset immediately (approaching OOM)
- Batch_size must remain 4 (fixed for this GPU)
- Do NOT increase block_size beyond 512 (OOM risk)
- Flash Attention 3 IS available (RTX 40-series supports it)
## Adjusted Config
Config fields you may change: n_layer (4-10), n_embd (384-640), n_head (4-10),
n_kv_head (1-4), WINDOW_PATTERN, SHORT_WINDOW (32-128), logit_softcap, use_squared_relu
Config fields you MUST NOT change: block_size=512, BATCH_SIZE=4, TIME_BUDGET=300To use a different dataset instead of climbmix-400b:
# In prepare.py: swap the dataset source
# The only requirement: a dataset with a 'text' column in parquet format
from huggingface_hub import snapshot_download
# Instead of climbmix:
DATASET_NAME = "your-org/your-dataset"
snapshot_download(
repo_id=DATASET_NAME,
repo_type="dataset",
local_dir=DATA_DIR,
allow_patterns=["*.parquet"],
)The tokenizer should be retrained on the new dataset:
# In prepare.py: retrain BPE on your data
# The BPE trainer is dataset-agnostic
train_tokenizer(stream_texts(DATA_DIR), vocab_size=50257)The autoresearch community has produced several notable extensions:
| Fork / Extension | Target Hardware | Key Changes |
|---|---|---|
| autoresearch-mps | macOS M-series | Replaced FA3 with SDPA, MPS device support |
| autoresearch-windows | Windows + CUDA | WSL2 setup, path fixes, spawn multiprocessing |
| autoresearch-amd | AMD ROCm | ROCm PyTorch, hipBLASLt attention |
| autoresearch-multi | Multi-GPU DDP | torchrun launcher, linear LR scaling |
| autoresearch-small | Consumer GPUs | Tiny/small configs for 8–24 GB GPUs |
| autoresearch-long | Long context | 4k–8k context with full sliding window |
The default evaluate_bpb uses a single validation set. For more robust evaluation:
# In prepare.py: multiple evaluation domains
def evaluate_bpb_multi(model, device, T):
"""
Evaluate on multiple domains for a more complete picture.
Returns a dict of domain -> val_bpb.
"""
results = {}
for domain in ['web', 'books', 'code', 'math']:
val_tokens = load_domain_validation(domain)
bpb = _evaluate_bpb_on_tokens(model, device, T, val_tokens)
results[domain] = bpb
results['average'] = np.mean(list(results.values()))
return resultsModify the output format in train.py to match what the agent greps:
# Extended output format
print(
f"val_bpb={results['average']:.4f} | "
f"val_bpb_web={results['web']:.4f} | "
f"val_bpb_code={results['code']:.4f} | "
f"memory_gb={memory_gb:.1f} | steps={total_steps}"
)Update program.md to grep for the composite metric:
## Success Criterion
Primary metric: val_bpb (the average across domains)
Also log: val_bpb_web, val_bpb_code for domain-specific trackinggraph TD
TUNE[Performance Tuning] --> T1{torch.compile enabled?}
T1 -->|No| EN_COMPILE[Add: model = torch.compile model]
T1 -->|Yes| T2{gc.freeze called?}
T2 -->|No| EN_GC[Add: gc.freeze before training loop]
T2 -->|Yes| T3{bfloat16 autocast?}
T3 -->|No| EN_BF16[Add: torch.autocast device_type=cuda dtype=torch.bfloat16]
T3 -->|Yes| T4{Flash Attention 3?}
T4 -->|No, CUDA available| EN_FA3[Install flash-attn, use flash_attn_varlen_func]
T4 -->|Yes or MPS| T5{Batch size maximized?}
T5 -->|No| EN_BATCH[Increase BATCH_SIZE until near OOM, then reduce by 10%]
T5 -->|Yes| DONE[Tuning complete]
| Scenario | Key Changes | Expected Performance |
|---|---|---|
| H100 80 GB (reference) | None — use defaults | val_bpb ~1.83, ~100 exp/night |
| A100 40 GB | n_embd=640, batch=4 | val_bpb ~1.86, ~95 exp/night |
| RTX 4090 24 GB | n_embd=512, block_size=512 | val_bpb ~1.90, ~90 exp/night |
| RTX 4080 16 GB | n_embd=384, block_size=512, batch=2 | val_bpb ~1.94, ~85 exp/night |
| Apple M3 Max | No FA3, MPS device, float32 | val_bpb ~1.96, ~40 exp/night |
| 4× H100 (DDP) | torchrun, lr×4, batch×4 | val_bpb ~1.78, ~100 exp/night |
| Multi-agent (3×) | Separate branches, separate machines | 3× experiments/night |
| AMD MI300X | ROCm PyTorch, hipBLASLt | val_bpb ~1.83 (comparable to H100) |
autoresearch distills an important insight about ML research: the bottleneck is not GPU compute — it is research iteration speed. By eliminating the human from the experiment loop, it turns a single GPU into a research engine that can explore 100 architectural hypotheses overnight.
The design principles that make this work are universal:
- Fix the evaluation (prepare.py is immutable)
- Fix the comparison unit (TIME_BUDGET=300s always)
- Use existing infrastructure (git for versioning, grep for parsing)
- Encode the protocol completely (program.md leaves no gaps)
- Prefer simplicity (the simplicity criterion shapes search)
These principles apply beyond autoresearch: any autonomous research agent benefits from clear evaluation metrics, comparable measurement units, minimal infrastructure, complete protocols, and a bias toward simplicity.
The ~70,000 GitHub stars suggest the community recognizes something genuine here: a minimum viable research agent that works, written in ~1000 lines of Python and one Markdown file.