Skip to content

Commit 72fb180

Browse files
PR-K1.E-vast: GPU runner for NIAH validation harness
Stacked on PR-K1.E (#74). The K1.E Python harness already supports CUDA via --device auto/cuda; this PR adds the vast.ai-side reviewer aid + restores the generic vast provisioning script that PR-R1c introduced (since closed but the infrastructure stays useful). Files: scripts/review_pr_k1e_on_vast.sh (157 lines) vast.ai (CUDA) reviewer aid for the K1.E NIAH validation. Two modes: * Single-context (default): one context length per run, same signature as the Mac M4 reviewer (scripts/review_pr_k1e_on_mac.sh) so the JSON outputs are directly comparable. * Multi-context scan (MULTI_CONTEXT=1): evaluates the same configurations across a ladder of context lengths in one invocation. Default ladder: ~1k, ~4k, ~16k tokens. Custom ladder via CONTEXT_LADDER env. Produces one JSON per context length so downstream analysis can plot recall vs context for each verifier configuration. Acceptance signals same as Mac: * v0.3 recall ~0.17 at 1k+ context (regression sanity vs 2026-06-06 A/B benchmark) * v0.4 recall close to oracle (within 5pp; ADR 0008 §11.8 gate (a)) * v0.4 >> v0.3 (target >= +50pp) Time budget on H100 (80 GB): * 2k context, 30 samples, 3 configs: ~5-8 min * 4k: ~10-15 min * 16k: ~30-45 min * 64k: ~60-90 min * 100k: ~90-150 min Default multi-context scan (1k, 4k, 16k) ~45-60 min on H100; sufficient to validate v0.4 ≥ 95% recall claim across the relevant range without going to the absolute scaling tail (100k requires ~10 GB just for the oracle KV cache, which rules out smaller GPUs). scripts/research/run_on_vast.sh (carried from PR-R1c, made generic) Generic vast.ai-side Python runner: provisions a venv with CUDA torch + transformers, verifies the GPU is visible to torch, then invokes a configurable Python script with the forwarded arguments. Key change vs the original PR-R1c version: * Hardcoded scripts/research/cross_attn_toy_prototype.py replaced with KAKEYA_VAST_SCRIPT env var (default still the toy for backward compat with closed PR-R1c reviewer scripts; reviewer aids like K1.E export the env to point at their own runner). * Removed the implicit --device auto append; the underlying scripts have their own device defaults so the runner stays argument-agnostic. * Header reframed: no longer ADR-0011-specific; documented as reusable infrastructure. Pre-flight (do once on the vast host): git fetch && git checkout main && git pull export HF_TOKEN=hf_xxx Single-context run: bash scripts/review_pr_k1e_on_vast.sh Multi-context scan: MULTI_CONTEXT=1 bash scripts/review_pr_k1e_on_vast.sh Custom ladder: MULTI_CONTEXT=1 CONTEXT_LADDER='80 320 1280 5000' \ bash scripts/review_pr_k1e_on_vast.sh Stacking notes: logical base is PR #74 (K1.E). After #74 lands on main, this PR's diff shrinks to just these two files. Order of review: #71 -> #72 -> #73 -> #74 -> this PR. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
1 parent 1162dd7 commit 72fb180

2 files changed

Lines changed: 359 additions & 0 deletions

File tree

scripts/research/run_on_vast.sh

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
#!/usr/bin/env bash
2+
# Linux / NVIDIA (vast.ai) runner — generic GPU provisioning + Python
3+
# script invocation for project research scripts. Originally written
4+
# for the ADR 0011 toy (PR-R1c, since closed) but kept on main as
5+
# reusable infrastructure: PR-K1.E and beyond use this same runner
6+
# for their vast.ai-side reviewer aids.
7+
#
8+
# Compatibility: developed/validated on an H200 (compute capability
9+
# 9.0, CUDA 13.0); also works on H100 / A100 / L4 / A10G with the
10+
# same wheel channel (cu128 forward-compatible). Run on the vast
11+
# host with the repo synced there.
12+
#
13+
# It is intentionally self-contained and idempotent:
14+
#
15+
# 1. Creates / reuses a venv at .venv-vast.
16+
# 2. Installs a CUDA-enabled torch + transformers stack (pinned to the
17+
# project's transformers 4.x line — see requirements.txt).
18+
# 3. Verifies the GPU is visible to torch.
19+
# 4. Runs scripts/research/cross_attn_toy_prototype.py once, forwarding
20+
# every argument after the script name straight through to the toy.
21+
#
22+
# The toy's default model (google/gemma-3-1b-it) is gated on HuggingFace.
23+
# Export HF_TOKEN (or HUGGING_FACE_HUB_TOKEN) before running; the script
24+
# refuses to start without one rather than failing 401 mid-download
25+
# (ADR 0008 §6.2: no silent fallback).
26+
#
27+
# Usage (run ON the vast host, repo synced there):
28+
#
29+
# # one full run, defaults (2000 steps, capacity-bumped):
30+
# HF_TOKEN=hf_xxx bash scripts/research/run_on_vast.sh \
31+
# --output results/research/cross_attn_toy_vast_full.json
32+
#
33+
# # just provision the venv (used by review_pr_r1c_on_vast.sh before
34+
# # it launches two runs in parallel):
35+
# HF_TOKEN=hf_xxx bash scripts/research/run_on_vast.sh --setup-only
36+
37+
set -euo pipefail
38+
39+
repo_root="$(cd "$(dirname "$0")/../.." && pwd)"
40+
cd "$repo_root"
41+
venv_dir="${repo_root}/.venv-vast"
42+
43+
# Default torch CUDA wheel channel. cu128/cu126 wheels run fine against
44+
# newer drivers (forward-compatible); override with KAKEYA_TORCH_INDEX
45+
# if the host needs a different channel.
46+
TORCH_INDEX="${KAKEYA_TORCH_INDEX:-https://download.pytorch.org/whl/cu128}"
47+
48+
log() { echo "[run_on_vast] $*" >&2; }
49+
50+
ensure_token() {
51+
if [[ -z "${HF_TOKEN:-}" && -n "${HUGGING_FACE_HUB_TOKEN:-}" ]]; then
52+
export HF_TOKEN="$HUGGING_FACE_HUB_TOKEN"
53+
fi
54+
if [[ -z "${HF_TOKEN:-}" ]]; then
55+
cat >&2 <<'EOF'
56+
[run_on_vast] HF_TOKEN is not set, but the toy's default model
57+
[run_on_vast] (google/gemma-3-1b-it) is GATED on HuggingFace. Export a
58+
[run_on_vast] token that has accepted the Gemma license:
59+
[run_on_vast] export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx
60+
[run_on_vast] then re-run. (ADR 0008 §6.2 forbids silent fallbacks.)
61+
EOF
62+
exit 4
63+
fi
64+
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
65+
}
66+
67+
ensure_gpu_present() {
68+
if ! command -v nvidia-smi >/dev/null 2>&1; then
69+
log "nvidia-smi not found — this script targets a CUDA GPU host."
70+
exit 1
71+
fi
72+
nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap \
73+
--format=csv,noheader >&2
74+
}
75+
76+
pick_python() {
77+
for cmd in python3.12 python3.11 python3.13 python3.10 python3; do
78+
if command -v "$cmd" >/dev/null 2>&1; then echo "$cmd"; return 0; fi
79+
done
80+
log "no compatible Python (3.10-3.13) found"; exit 1
81+
}
82+
83+
ensure_venv() {
84+
local py="$1"
85+
if [[ ! -d "$venv_dir" ]]; then
86+
log "creating venv at $venv_dir using $py"
87+
"$py" -m venv "$venv_dir"
88+
else
89+
log "reusing venv at $venv_dir"
90+
fi
91+
# shellcheck disable=SC1091
92+
source "$venv_dir/bin/activate"
93+
python -m pip install --upgrade pip --quiet
94+
}
95+
96+
install_stack() {
97+
if python -c "import torch" 2>/dev/null && \
98+
python -c "import transformers" 2>/dev/null; then
99+
log "torch + transformers already importable; skipping install"
100+
return 0
101+
fi
102+
log "installing CUDA torch from $TORCH_INDEX"
103+
pip install --quiet "torch>=2.4,<3.0" --index-url "$TORCH_INDEX"
104+
log "installing transformers/accelerate stack (4.x pin)"
105+
pip install --quiet \
106+
"transformers>=4.45,<5.0" \
107+
"accelerate>=0.34" \
108+
"safetensors>=0.4" \
109+
"huggingface_hub>=0.24" \
110+
"numpy>=1.26"
111+
}
112+
113+
verify_torch_cuda() {
114+
python - <<'PY'
115+
import sys
116+
import torch
117+
ok = torch.cuda.is_available()
118+
print(f"[run_on_vast] torch={torch.__version__} cuda_available={ok} "
119+
f"cuda={torch.version.cuda}", file=sys.stderr)
120+
if ok:
121+
print(f"[run_on_vast] device0={torch.cuda.get_device_name(0)}",
122+
file=sys.stderr)
123+
else:
124+
print("[run_on_vast] WARNING: torch cannot see the GPU; the toy will "
125+
"fall back to CPU and be extremely slow.", file=sys.stderr)
126+
sys.exit(5)
127+
import transformers
128+
print(f"[run_on_vast] transformers={transformers.__version__}",
129+
file=sys.stderr)
130+
PY
131+
}
132+
133+
provision() {
134+
ensure_gpu_present
135+
local py; py="$(pick_python)"
136+
ensure_venv "$py"
137+
install_stack
138+
verify_torch_cuda
139+
}
140+
141+
main() {
142+
ensure_token
143+
144+
local setup_only=0
145+
local fwd=()
146+
for arg in "$@"; do
147+
if [[ "$arg" == "--setup-only" ]]; then
148+
setup_only=1
149+
else
150+
fwd+=("$arg")
151+
fi
152+
done
153+
154+
provision
155+
156+
if [[ "$setup_only" == "1" ]]; then
157+
log "setup-only complete; venv ready at $venv_dir"
158+
return 0
159+
fi
160+
161+
# Pick the Python script to run. KAKEYA_VAST_SCRIPT env var is
162+
# the explicit override; reviewer aids set it to point at their
163+
# own runner (e.g., scripts/research/k1e_niah_validation.py).
164+
# The default keeps backward compatibility with the original
165+
# ADR 0011 toy reviewer scripts (PR-R1c, since closed) so
166+
# historical reproducibility is preserved.
167+
local script="${KAKEYA_VAST_SCRIPT:-scripts/research/cross_attn_toy_prototype.py}"
168+
if [[ ! -f "$script" ]]; then
169+
log "script $script not found in repo (cwd=$PWD); pass "
170+
log "KAKEYA_VAST_SCRIPT=path/to/your_script.py to override"
171+
exit 6
172+
fi
173+
174+
log "launching $script: ${fwd[*]:-<defaults>}"
175+
PYTHONPATH=".:sdks/python" python "$script" \
176+
"${fwd[@]}"
177+
}
178+
179+
main "$@"

scripts/review_pr_k1e_on_vast.sh

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
#!/usr/bin/env bash
2+
# vast.ai (CUDA) reviewer aid for PR-K1.E — GPU acceleration of the
3+
# NIAH validation harness.
4+
#
5+
# Same K1.E harness as the Mac M4 reviewer
6+
# (scripts/review_pr_k1e_on_mac.sh), but routed through the existing
7+
# vast provisioning machinery (scripts/research/run_on_vast.sh) and
8+
# tuned for CUDA-class hardware. Two modes:
9+
#
10+
# * Single-context (default): evaluate one context length per run.
11+
# Useful for fast iteration during development.
12+
#
13+
# * Multi-context scan (MULTI_CONTEXT=1): evaluate the same model
14+
# and configurations across several context lengths in one
15+
# invocation, producing a recall-vs-context-length curve. This
16+
# is the form that empirically validates the ADR 0008 §11.8
17+
# gate (a) target ("≥ 95 % at 100 k") AND demonstrates how v0.4
18+
# scales relative to v0.3 sink+window AND the full-attention
19+
# oracle.
20+
#
21+
# Time budget on a vast.ai NVIDIA H100 (80 GB):
22+
#
23+
# * 2 k context, 30 samples, all 3 configs: ~5-8 min
24+
# * 4 k context, 30 samples, all 3 configs: ~10-15 min
25+
# * 16 k context, 30 samples, all 3 configs: ~30-45 min
26+
# * 64 k context, 20 samples: ~60-90 min
27+
# * 100 k context, 20 samples: ~90-150 min
28+
#
29+
# Multi-context scan default (1 k → 4 k → 16 k) runs in ~45-60 min.
30+
# Default targets H100; on A100 80 GB add ~50-100 % for compute-bound
31+
# v0.4 forwards. Smaller GPUs (A10G 24 GB) cap out around 16 k tokens
32+
# for the oracle config but can still run v0.4 at any size (sustained
33+
# memory is constant in context length by design).
34+
#
35+
# Acceptance signals — same as the Mac reviewer:
36+
#
37+
# * v0.3 recall ≈ 0.17 at 1 k+ context (matches the
38+
# 2026-06-06 A/B benchmark; sanity that the regression
39+
# reproduces)
40+
# * v0.4 recall close to oracle (within 5 pp; ADR 0008 §11.8
41+
# gate (a) at the run's context length)
42+
# * v0.4 ≫ v0.3 (target ≥ +50 pp; ADR 0008 §11.5 §"Five
43+
# properties" item 2 — intelligence approximates full attention)
44+
#
45+
# Usage:
46+
#
47+
# # Setup: vast instance must be running, repo synced, HF_TOKEN exported
48+
# HF_TOKEN=hf_xxx bash scripts/review_pr_k1e_on_vast.sh
49+
#
50+
# # Larger single-context run:
51+
# HAYSTACK_MIN=900 HAYSTACK_MAX=1100 N_SAMPLES=30 \
52+
# bash scripts/review_pr_k1e_on_vast.sh
53+
#
54+
# # Multi-context scan with default ladder (~30, ~120, ~500 lines
55+
# # ≈ 1-2k, 4k, 16k tokens):
56+
# MULTI_CONTEXT=1 bash scripts/review_pr_k1e_on_vast.sh
57+
#
58+
# # Custom multi-context scan (lines per context — line ≈ 14 tokens):
59+
# MULTI_CONTEXT=1 \
60+
# CONTEXT_LADDER='80 320 1280 5000' \
61+
# bash scripts/review_pr_k1e_on_vast.sh
62+
#
63+
# Env knobs:
64+
#
65+
# N_SAMPLES (default 30) samples per (config, context length)
66+
# HAYSTACK_MIN (default 60) single-context: min padding-line count
67+
# HAYSTACK_MAX (default 80) single-context: max padding-line count
68+
# SINK (default 4)
69+
# WINDOW (default 64)
70+
# MAX_NEW_TOKENS (default 24)
71+
# SEED (default 42)
72+
# SKIP_V03=1 skip the v0.3 baseline
73+
# SKIP_V04=1 skip v0.4 (oracle-only smoke)
74+
# SKIP_ORACLE=1 skip the oracle (not recommended)
75+
# MULTI_CONTEXT=1 enable multi-context scan
76+
# CONTEXT_LADDER='40 80 320 1280' (only used when MULTI_CONTEXT=1)
77+
# space-separated padding-line counts;
78+
# each entry yields a haystack range of
79+
# [n × 0.85, n × 1.15] for variability.
80+
81+
set -euo pipefail
82+
83+
ROOT="$(cd "$(dirname "$0")/.." && pwd)"
84+
cd "$ROOT"
85+
86+
N_SAMPLES="${N_SAMPLES:-30}"
87+
HAYSTACK_MIN="${HAYSTACK_MIN:-60}"
88+
HAYSTACK_MAX="${HAYSTACK_MAX:-80}"
89+
SINK="${SINK:-4}"
90+
WINDOW="${WINDOW:-64}"
91+
MAX_NEW_TOKENS="${MAX_NEW_TOKENS:-24}"
92+
SEED="${SEED:-42}"
93+
SKIP_V03="${SKIP_V03:-0}"
94+
SKIP_V04="${SKIP_V04:-0}"
95+
SKIP_ORACLE="${SKIP_ORACLE:-0}"
96+
MULTI_CONTEXT="${MULTI_CONTEXT:-0}"
97+
# Default ladder: ~1k, ~4k, ~16k tokens (line ≈ 14 tokens)
98+
CONTEXT_LADDER="${CONTEXT_LADDER:-70 280 1100}"
99+
100+
stamp="$(date +%s)"
101+
out_dir="results/research"
102+
log_dir="${out_dir}/logs"
103+
mkdir -p "$out_dir" "$log_dir"
104+
105+
flags_common=(
106+
--model google/gemma-3-1b-it
107+
--device cuda
108+
--n-samples "$N_SAMPLES"
109+
--sink-size "$SINK"
110+
--window-size "$WINDOW"
111+
--max-new-tokens "$MAX_NEW_TOKENS"
112+
--seed "$SEED"
113+
)
114+
[[ "$SKIP_V03" == "1" ]] && flags_common+=(--skip-v03)
115+
[[ "$SKIP_V04" == "1" ]] && flags_common+=(--skip-v04)
116+
[[ "$SKIP_ORACLE" == "1" ]] && flags_common+=(--skip-oracle)
117+
118+
# Tell the generic vast runner which Python script to invoke.
119+
export KAKEYA_VAST_SCRIPT="scripts/research/k1e_niah_validation.py"
120+
121+
# Provision venv ONCE before any runs.
122+
echo "==> provisioning venv (one-time)"
123+
bash scripts/research/run_on_vast.sh --setup-only
124+
125+
run_one() {
126+
local label="$1"; local lo="$2"; local hi="$3"
127+
local report="${out_dir}/k1e_niah_vast_${label}_${stamp}.json"
128+
local log="${log_dir}/k1e_niah_vast_${label}_${stamp}.log"
129+
echo
130+
echo "==> Run $label: haystack lines [$lo, $hi]"
131+
echo " Report: $report"
132+
echo " Log: $log"
133+
bash scripts/research/run_on_vast.sh \
134+
"${flags_common[@]}" \
135+
--haystack-min-lines "$lo" \
136+
--haystack-max-lines "$hi" \
137+
--output "$report" \
138+
2>&1 | tee "$log"
139+
echo " -> finished $label"
140+
}
141+
142+
if [[ "$MULTI_CONTEXT" == "1" ]]; then
143+
echo "==> PR-K1.E NIAH validation — vast.ai CUDA, multi-context scan"
144+
echo " Model: google/gemma-3-1b-it"
145+
echo " Samples each: $N_SAMPLES"
146+
echo " Sink x window: ${SINK} x ${WINDOW}"
147+
echo " Context ladder (padding lines): $CONTEXT_LADDER"
148+
echo " Configs: oracle + v0.3 + v0.4 (modulo skip flags)"
149+
echo
150+
151+
for n in $CONTEXT_LADDER; do
152+
# ±15 % range around target line count
153+
lo=$(( (n * 85 + 50) / 100 ))
154+
hi=$(( (n * 115 + 50) / 100 ))
155+
if [[ $lo -lt 10 ]]; then lo=10; fi
156+
if [[ $hi -lt $((lo + 1)) ]]; then hi=$((lo + 1)); fi
157+
run_one "ctx${n}" "$lo" "$hi"
158+
done
159+
160+
echo
161+
echo "==> Multi-context scan complete. Reports under:"
162+
echo " $out_dir/k1e_niah_vast_ctx*_${stamp}.json"
163+
echo " $log_dir/k1e_niah_vast_ctx*_${stamp}.log"
164+
else
165+
echo "==> PR-K1.E NIAH validation — vast.ai CUDA, single-context"
166+
echo " Model: google/gemma-3-1b-it"
167+
echo " Samples: $N_SAMPLES"
168+
echo " Haystack: [$HAYSTACK_MIN, $HAYSTACK_MAX] lines"
169+
echo " Sink x window: ${SINK} x ${WINDOW}"
170+
echo " Configs: oracle + v0.3 + v0.4 (modulo skip flags)"
171+
echo
172+
173+
run_one "single" "$HAYSTACK_MIN" "$HAYSTACK_MAX"
174+
fi
175+
176+
echo
177+
echo "Commit:"
178+
echo " git add $out_dir/k1e_niah_vast_*_${stamp}.json $log_dir/k1e_niah_vast_*_${stamp}.log"
179+
echo " git commit -m 'vast H100/A100 K1.E NIAH validation evidence'"
180+
echo " git push"

0 commit comments

Comments
 (0)