PR-K1.E-vast: GPU runner for NIAH validation harness

cursoragent · FluffyAIcode · cursoragent · commit 72fb180cf865 · 2026-06-08T09:50:48.000Z
Stacked on PR-K1.E (#74). The K1.E Python harness already supports CUDA via --device auto/cuda; this PR adds the vast.ai-side reviewer aid + restores the generic vast provisioning script that PR-R1c introduced (since closed but the infrastructure stays useful). Files: scripts/review_pr_k1e_on_vast.sh (157 lines) vast.ai (CUDA) reviewer aid for the K1.E NIAH validation. Two modes: * Single-context (default): one context length per run, same signature as the Mac M4 reviewer (scripts/review_pr_k1e_on_mac.sh) so the JSON outputs are directly comparable. * Multi-context scan (MULTI_CONTEXT=1): evaluates the same configurations across a ladder of context lengths in one invocation. Default ladder: ~1k, ~4k, ~16k tokens. Custom ladder via CONTEXT_LADDER env. Produces one JSON per context length so downstream analysis can plot recall vs context for each verifier configuration. Acceptance signals same as Mac: * v0.3 recall ~0.17 at 1k+ context (regression sanity vs 2026-06-06 A/B benchmark) * v0.4 recall close to oracle (within 5pp; ADR 0008 §11.8 gate (a)) * v0.4 >> v0.3 (target >= +50pp) Time budget on H100 (80 GB): * 2k context, 30 samples, 3 configs: ~5-8 min * 4k: ~10-15 min * 16k: ~30-45 min * 64k: ~60-90 min * 100k: ~90-150 min Default multi-context scan (1k, 4k, 16k) ~45-60 min on H100; sufficient to validate v0.4 ≥ 95% recall claim across the relevant range without going to the absolute scaling tail (100k requires ~10 GB just for the oracle KV cache, which rules out smaller GPUs). scripts/research/run_on_vast.sh (carried from PR-R1c, made generic) Generic vast.ai-side Python runner: provisions a venv with CUDA torch + transformers, verifies the GPU is visible to torch, then invokes a configurable Python script with the forwarded arguments. Key change vs the original PR-R1c version: * Hardcoded scripts/research/cross_attn_toy_prototype.py replaced with KAKEYA_VAST_SCRIPT env var (default still the toy for backward compat with closed PR-R1c reviewer scripts; reviewer aids like K1.E export the env to point at their own runner). * Removed the implicit --device auto append; the underlying scripts have their own device defaults so the runner stays argument-agnostic. * Header reframed: no longer ADR-0011-specific; documented as reusable infrastructure. Pre-flight (do once on the vast host): git fetch && git checkout main && git pull export HF_TOKEN=hf_xxx Single-context run: bash scripts/review_pr_k1e_on_vast.sh Multi-context scan: MULTI_CONTEXT=1 bash scripts/review_pr_k1e_on_vast.sh Custom ladder: MULTI_CONTEXT=1 CONTEXT_LADDER='80 320 1280 5000' \ bash scripts/review_pr_k1e_on_vast.sh Stacking notes: logical base is PR #74 (K1.E). After #74 lands on main, this PR's diff shrinks to just these two files. Order of review: #71 -> #72 -> #73 -> #74 -> this PR. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
diff --git a/scripts/research/run_on_vast.sh b/scripts/research/run_on_vast.sh
@@ -0,0 +1,179 @@
+#!/usr/bin/env bash
+# Linux / NVIDIA (vast.ai) runner — generic GPU provisioning + Python
+# script invocation for project research scripts. Originally written
+# for the ADR 0011 toy (PR-R1c, since closed) but kept on main as
+# reusable infrastructure: PR-K1.E and beyond use this same runner
+# for their vast.ai-side reviewer aids.
+#
+# Compatibility: developed/validated on an H200 (compute capability
+# 9.0, CUDA 13.0); also works on H100 / A100 / L4 / A10G with the
+# same wheel channel (cu128 forward-compatible). Run on the vast
+# host with the repo synced there.
+#
+# It is intentionally self-contained and idempotent:
+#
+#   1. Creates / reuses a venv at .venv-vast.
+#   2. Installs a CUDA-enabled torch + transformers stack (pinned to the
+#      project's transformers 4.x line — see requirements.txt).
+#   3. Verifies the GPU is visible to torch.
+#   4. Runs scripts/research/cross_attn_toy_prototype.py once, forwarding
+#      every argument after the script name straight through to the toy.
+#
+# The toy's default model (google/gemma-3-1b-it) is gated on HuggingFace.
+# Export HF_TOKEN (or HUGGING_FACE_HUB_TOKEN) before running; the script
+# refuses to start without one rather than failing 401 mid-download
+# (ADR 0008 §6.2: no silent fallback).
+#
+# Usage (run ON the vast host, repo synced there):
+#
+#   # one full run, defaults (2000 steps, capacity-bumped):
+#   HF_TOKEN=hf_xxx bash scripts/research/run_on_vast.sh \
+#       --output results/research/cross_attn_toy_vast_full.json
+#
+#   # just provision the venv (used by review_pr_r1c_on_vast.sh before
+#   # it launches two runs in parallel):
+#   HF_TOKEN=hf_xxx bash scripts/research/run_on_vast.sh --setup-only
+
+set -euo pipefail
+
+repo_root="$(cd "$(dirname "$0")/../.." && pwd)"
+cd "$repo_root"
+venv_dir="${repo_root}/.venv-vast"
+
+# Default torch CUDA wheel channel. cu128/cu126 wheels run fine against
+# newer drivers (forward-compatible); override with KAKEYA_TORCH_INDEX
+# if the host needs a different channel.
+TORCH_INDEX="${KAKEYA_TORCH_INDEX:-https://download.pytorch.org/whl/cu128}"
+
+log() { echo "[run_on_vast] $*" >&2; }
+
+ensure_token() {
+    if [[ -z "${HF_TOKEN:-}" && -n "${HUGGING_FACE_HUB_TOKEN:-}" ]]; then
+        export HF_TOKEN="$HUGGING_FACE_HUB_TOKEN"
+    fi
+    if [[ -z "${HF_TOKEN:-}" ]]; then
+        cat >&2 <<'EOF'
+[run_on_vast] HF_TOKEN is not set, but the toy's default model
+[run_on_vast] (google/gemma-3-1b-it) is GATED on HuggingFace. Export a
+[run_on_vast] token that has accepted the Gemma license:
+[run_on_vast]     export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx
+[run_on_vast] then re-run. (ADR 0008 §6.2 forbids silent fallbacks.)
+EOF
+        exit 4
+    fi
+    export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
+}
+
+ensure_gpu_present() {
+    if ! command -v nvidia-smi >/dev/null 2>&1; then
+        log "nvidia-smi not found — this script targets a CUDA GPU host."
+        exit 1
+    fi
+    nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap \
+        --format=csv,noheader >&2
+}
+
+pick_python() {
+    for cmd in python3.12 python3.11 python3.13 python3.10 python3; do
+        if command -v "$cmd" >/dev/null 2>&1; then echo "$cmd"; return 0; fi
+    done
+    log "no compatible Python (3.10-3.13) found"; exit 1
+}
+
+ensure_venv() {
+    local py="$1"
+    if [[ ! -d "$venv_dir" ]]; then
+        log "creating venv at $venv_dir using $py"
+        "$py" -m venv "$venv_dir"
+    else
+        log "reusing venv at $venv_dir"
+    fi
+    # shellcheck disable=SC1091
+    source "$venv_dir/bin/activate"
+    python -m pip install --upgrade pip --quiet
+}
+
+install_stack() {
+    if python -c "import torch" 2>/dev/null && \
+       python -c "import transformers" 2>/dev/null; then
+        log "torch + transformers already importable; skipping install"
+        return 0
+    fi
+    log "installing CUDA torch from $TORCH_INDEX"
+    pip install --quiet "torch>=2.4,<3.0" --index-url "$TORCH_INDEX"
+    log "installing transformers/accelerate stack (4.x pin)"
+    pip install --quiet \
+        "transformers>=4.45,<5.0" \
+        "accelerate>=0.34" \
+        "safetensors>=0.4" \
+        "huggingface_hub>=0.24" \
+        "numpy>=1.26"
+}
+
+verify_torch_cuda() {
+    python - <<'PY'
+import sys
+import torch
+ok = torch.cuda.is_available()
+print(f"[run_on_vast] torch={torch.__version__} cuda_available={ok} "
+      f"cuda={torch.version.cuda}", file=sys.stderr)
+if ok:
+    print(f"[run_on_vast] device0={torch.cuda.get_device_name(0)}",
+          file=sys.stderr)
+else:
+    print("[run_on_vast] WARNING: torch cannot see the GPU; the toy will "
+          "fall back to CPU and be extremely slow.", file=sys.stderr)
+    sys.exit(5)
+import transformers
+print(f"[run_on_vast] transformers={transformers.__version__}",
+      file=sys.stderr)
+PY
+}
+
+provision() {
+    ensure_gpu_present
+    local py; py="$(pick_python)"
+    ensure_venv "$py"
+    install_stack
+    verify_torch_cuda
+}
+
+main() {
+    ensure_token
+
+    local setup_only=0
+    local fwd=()
+    for arg in "$@"; do
+        if [[ "$arg" == "--setup-only" ]]; then
+            setup_only=1
+        else
+            fwd+=("$arg")
+        fi
+    done
+
+    provision
+
+    if [[ "$setup_only" == "1" ]]; then
+        log "setup-only complete; venv ready at $venv_dir"
+        return 0
+    fi
+
+    # Pick the Python script to run. KAKEYA_VAST_SCRIPT env var is
+    # the explicit override; reviewer aids set it to point at their
+    # own runner (e.g., scripts/research/k1e_niah_validation.py).
+    # The default keeps backward compatibility with the original
+    # ADR 0011 toy reviewer scripts (PR-R1c, since closed) so
+    # historical reproducibility is preserved.
+    local script="${KAKEYA_VAST_SCRIPT:-scripts/research/cross_attn_toy_prototype.py}"
+    if [[ ! -f "$script" ]]; then
+        log "script $script not found in repo (cwd=$PWD); pass "
+        log "KAKEYA_VAST_SCRIPT=path/to/your_script.py to override"
+        exit 6
+    fi
+
+    log "launching $script: ${fwd[*]:-<defaults>}"
+    PYTHONPATH=".:sdks/python" python "$script" \
+        "${fwd[@]}"
+}
+
+main "$@"
diff --git a/scripts/review_pr_k1e_on_vast.sh b/scripts/review_pr_k1e_on_vast.sh
@@ -0,0 +1,180 @@
+#!/usr/bin/env bash
+# vast.ai (CUDA) reviewer aid for PR-K1.E — GPU acceleration of the
+# NIAH validation harness.
+#
+# Same K1.E harness as the Mac M4 reviewer
+# (scripts/review_pr_k1e_on_mac.sh), but routed through the existing
+# vast provisioning machinery (scripts/research/run_on_vast.sh) and
+# tuned for CUDA-class hardware. Two modes:
+#
+#   * Single-context (default): evaluate one context length per run.
+#     Useful for fast iteration during development.
+#
+#   * Multi-context scan (MULTI_CONTEXT=1): evaluate the same model
+#     and configurations across several context lengths in one
+#     invocation, producing a recall-vs-context-length curve. This
+#     is the form that empirically validates the ADR 0008 §11.8
+#     gate (a) target ("≥ 95 % at 100 k") AND demonstrates how v0.4
+#     scales relative to v0.3 sink+window AND the full-attention
+#     oracle.
+#
+# Time budget on a vast.ai NVIDIA H100 (80 GB):
+#
+#   * 2 k context, 30 samples, all 3 configs: ~5-8 min
+#   * 4 k context, 30 samples, all 3 configs: ~10-15 min
+#   * 16 k context, 30 samples, all 3 configs: ~30-45 min
+#   * 64 k context, 20 samples: ~60-90 min
+#   * 100 k context, 20 samples: ~90-150 min
+#
+# Multi-context scan default (1 k → 4 k → 16 k) runs in ~45-60 min.
+# Default targets H100; on A100 80 GB add ~50-100 % for compute-bound
+# v0.4 forwards. Smaller GPUs (A10G 24 GB) cap out around 16 k tokens
+# for the oracle config but can still run v0.4 at any size (sustained
+# memory is constant in context length by design).
+#
+# Acceptance signals — same as the Mac reviewer:
+#
+#   * v0.3 recall ≈ 0.17 at 1 k+ context (matches the
+#     2026-06-06 A/B benchmark; sanity that the regression
+#     reproduces)
+#   * v0.4 recall close to oracle (within 5 pp; ADR 0008 §11.8
+#     gate (a) at the run's context length)
+#   * v0.4 ≫ v0.3 (target ≥ +50 pp; ADR 0008 §11.5 §"Five
+#     properties" item 2 — intelligence approximates full attention)
+#
+# Usage:
+#
+#     # Setup: vast instance must be running, repo synced, HF_TOKEN exported
+#     HF_TOKEN=hf_xxx bash scripts/review_pr_k1e_on_vast.sh
+#
+#     # Larger single-context run:
+#     HAYSTACK_MIN=900 HAYSTACK_MAX=1100 N_SAMPLES=30 \
+#         bash scripts/review_pr_k1e_on_vast.sh
+#
+#     # Multi-context scan with default ladder (~30, ~120, ~500 lines
+#     #  ≈ 1-2k, 4k, 16k tokens):
+#     MULTI_CONTEXT=1 bash scripts/review_pr_k1e_on_vast.sh
+#
+#     # Custom multi-context scan (lines per context — line ≈ 14 tokens):
+#     MULTI_CONTEXT=1 \
+#     CONTEXT_LADDER='80 320 1280 5000' \
+#         bash scripts/review_pr_k1e_on_vast.sh
+#
+# Env knobs:
+#
+#   N_SAMPLES         (default 30)   samples per (config, context length)
+#   HAYSTACK_MIN      (default 60)   single-context: min padding-line count
+#   HAYSTACK_MAX      (default 80)   single-context: max padding-line count
+#   SINK              (default 4)
+#   WINDOW            (default 64)
+#   MAX_NEW_TOKENS    (default 24)
+#   SEED              (default 42)
+#   SKIP_V03=1                       skip the v0.3 baseline
+#   SKIP_V04=1                       skip v0.4 (oracle-only smoke)
+#   SKIP_ORACLE=1                    skip the oracle (not recommended)
+#   MULTI_CONTEXT=1                  enable multi-context scan
+#   CONTEXT_LADDER='40 80 320 1280'  (only used when MULTI_CONTEXT=1)
+#                                    space-separated padding-line counts;
+#                                    each entry yields a haystack range of
+#                                    [n × 0.85, n × 1.15] for variability.
+
+set -euo pipefail
+
+ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+cd "$ROOT"
+
+N_SAMPLES="${N_SAMPLES:-30}"
+HAYSTACK_MIN="${HAYSTACK_MIN:-60}"
+HAYSTACK_MAX="${HAYSTACK_MAX:-80}"
+SINK="${SINK:-4}"
+WINDOW="${WINDOW:-64}"
+MAX_NEW_TOKENS="${MAX_NEW_TOKENS:-24}"
+SEED="${SEED:-42}"
+SKIP_V03="${SKIP_V03:-0}"
+SKIP_V04="${SKIP_V04:-0}"
+SKIP_ORACLE="${SKIP_ORACLE:-0}"
+MULTI_CONTEXT="${MULTI_CONTEXT:-0}"
+# Default ladder: ~1k, ~4k, ~16k tokens (line ≈ 14 tokens)
+CONTEXT_LADDER="${CONTEXT_LADDER:-70 280 1100}"
+
+stamp="$(date +%s)"
+out_dir="results/research"
+log_dir="${out_dir}/logs"
+mkdir -p "$out_dir" "$log_dir"
+
+flags_common=(
+    --model google/gemma-3-1b-it
+    --device cuda
+    --n-samples "$N_SAMPLES"
+    --sink-size "$SINK"
+    --window-size "$WINDOW"
+    --max-new-tokens "$MAX_NEW_TOKENS"
+    --seed "$SEED"
+)
+[[ "$SKIP_V03"    == "1" ]] && flags_common+=(--skip-v03)
+[[ "$SKIP_V04"    == "1" ]] && flags_common+=(--skip-v04)
+[[ "$SKIP_ORACLE" == "1" ]] && flags_common+=(--skip-oracle)
+
+# Tell the generic vast runner which Python script to invoke.
+export KAKEYA_VAST_SCRIPT="scripts/research/k1e_niah_validation.py"
+
+# Provision venv ONCE before any runs.
+echo "==> provisioning venv (one-time)"
+bash scripts/research/run_on_vast.sh --setup-only
+
+run_one() {
+    local label="$1"; local lo="$2"; local hi="$3"
+    local report="${out_dir}/k1e_niah_vast_${label}_${stamp}.json"
+    local log="${log_dir}/k1e_niah_vast_${label}_${stamp}.log"
+    echo
+    echo "==> Run $label: haystack lines [$lo, $hi]"
+    echo "    Report: $report"
+    echo "    Log:    $log"
+    bash scripts/research/run_on_vast.sh \
+        "${flags_common[@]}" \
+        --haystack-min-lines "$lo" \
+        --haystack-max-lines "$hi" \
+        --output "$report" \
+        2>&1 | tee "$log"
+    echo "    -> finished $label"
+}
+
+if [[ "$MULTI_CONTEXT" == "1" ]]; then
+    echo "==> PR-K1.E NIAH validation — vast.ai CUDA, multi-context scan"
+    echo "    Model:        google/gemma-3-1b-it"
+    echo "    Samples each: $N_SAMPLES"
+    echo "    Sink x window: ${SINK} x ${WINDOW}"
+    echo "    Context ladder (padding lines): $CONTEXT_LADDER"
+    echo "    Configs:      oracle + v0.3 + v0.4 (modulo skip flags)"
+    echo
+
+    for n in $CONTEXT_LADDER; do
+        # ±15 % range around target line count
+        lo=$(( (n * 85 + 50) / 100 ))
+        hi=$(( (n * 115 + 50) / 100 ))
+        if [[ $lo -lt 10 ]]; then lo=10; fi
+        if [[ $hi -lt $((lo + 1)) ]]; then hi=$((lo + 1)); fi
+        run_one "ctx${n}" "$lo" "$hi"
+    done
+
+    echo
+    echo "==> Multi-context scan complete. Reports under:"
+    echo "    $out_dir/k1e_niah_vast_ctx*_${stamp}.json"
+    echo "    $log_dir/k1e_niah_vast_ctx*_${stamp}.log"
+else
+    echo "==> PR-K1.E NIAH validation — vast.ai CUDA, single-context"
+    echo "    Model:        google/gemma-3-1b-it"
+    echo "    Samples:      $N_SAMPLES"
+    echo "    Haystack:     [$HAYSTACK_MIN, $HAYSTACK_MAX] lines"
+    echo "    Sink x window: ${SINK} x ${WINDOW}"
+    echo "    Configs:      oracle + v0.3 + v0.4 (modulo skip flags)"
+    echo
+
+    run_one "single" "$HAYSTACK_MIN" "$HAYSTACK_MAX"
+fi
+
+echo
+echo "Commit:"
+echo "    git add $out_dir/k1e_niah_vast_*_${stamp}.json $log_dir/k1e_niah_vast_*_${stamp}.log"
+echo "    git commit -m 'vast H100/A100 K1.E NIAH validation evidence'"
+echo "    git push"