Add speed profiler (#431)

officialgr · web-flow · commit c0beb2af30a5 · 2026-06-27T00:35:12.000+02:00
* Add report-only GPU speed profiling workflow

* Add Lucebox speed profiler

* Document speed profiler workflow

* Fix review findings: losslessness gate, temp dir, nsys tokens

- Losslessness gate now aggregates across all prompts (was only
  checking the first). A gate that can silently miss divergences
  on prompts 2+ is worse than no gate.
- Temp dir changed to mkdtemp() per run (unique path, no shared
  /tmp/lucebox_profile). Low real-world risk since CI serializes
  via the concurrency group, but a clean fix regardless.
- nsys per-token metrics now normalise by actual generated token
  count rather than requested n_gen. Note: per-token figures
  remain directional until the NVTX decode-window refactor (the
  nsys pass still covers load + prefill + decode together).

* Fix review findings: clock reset, summary guard

- Added "Reset GPU clocks" step (nvidia-smi -rgc, if: always())
  so the lock set at job start never persists on the shared runner
  between PRs.
- Guarded `cat server/profile.md` with a file-existence check so
  the summary step doesn't error when the profiler run fails.

* Clean up mkdtemp run dir on exit

The unique per-run temp dir (added to fix the earlier race) was never
removed, so repeated CI runs would grow /tmp unbounded. Register an
atexit cleanup that runs on success or crash; the kept artifacts
(json/md/nsys-rep) are written outside the temp dir before cleanup.

* Fix CI: install profiler Python deps before running

The speed-profile job built the binaries but invoked profile.py with the
bare system python3, which lacks `transformers` (used to tokenize prompts).
Install it into an isolated venv so the shared runner's Python isn't
polluted and the full project env (torch, etc.) isn't dragged in for a
tokenizer-only dependency.

* Fix speed profile model and pin deps

* Update speed profile workflow for model paths and checks

* Update profile.py

* Revise speed profile documentation with new parameters

Updated parameters and CI details for speed profiling, including losslessness gate and local run instructions.

* Enhance speed profiling workflow with regression checks

* Quote speed profile baseline args safely

Build the optional baseline arguments as a Bash array and expand them with "${baseline_arg[@]}" so baseline paths containing spaces or shell metacharacters are passed to the profiler as intended.

* Report partial nsys parsing failures

Treat missing kernel-level nsys stats as a profiling error even when other nsys sections parse successfully, preventing zero kernel metrics from being rendered as real data. Surface non-critical nsys section failures as Markdown warnings while preserving valid kernel metrics.

* Render nsys warnings with errors

* Fix formatting in speed-profile.md

* Use 128 generated tokens in speed profiler

* Report run-to-run (within-prompt) stddev in profiler headline

The headline ± pooled every per-rep sample across all prompts, so the large,
repeatable prompt-to-prompt throughput spread leaked into the stddev and
overstated run-to-run noise (e.g. decode 42.04 ± 12.14, and AL ± 1.43 even
though per-prompt AL is bit-identical across reps). It also inflated the
baseline 1σ band, making real regressions easier to dismiss as "overlap".

Compute the headline stddev/rsd as the pooled within-prompt stddev so it
reflects actual measurement noise. Prompt-to-prompt spread stays visible via
the min–max line and the per-prompt table. The NOISY verdict already used
per-prompt RSD and is unchanged.

* Fix formatting in speed-profile.md

* Enhance speed-profile documentation with baseline comparison

Expanded explanation of profiler functionality and usage, including baseline comparison steps.
diff --git a/.github/workflows/speed-profile.yml b/.github/workflows/speed-profile.yml
@@ -0,0 +1,211 @@
+name: Speed Profile
+
+# Report-only speed profile for the inference engine. Runs on the self-hosted
+# RTX 3090 (lucebox3) on PRs that touch the engine or the optimizations, and on
+# manual dispatch. It NEVER blocks a PR (continue-on-error: true) — it publishes a
+# report to the run summary + uploads the JSON / markdown / nsys trace as artifacts.
+#
+# Why report-only: perf has run-to-run variance (thermals, clocks, scheduling).
+# Gating a merge on a noisy absolute number produces false failures. We surface the
+# trend first; a soft threshold can come later once a baseline + variance band exist.
+
+on:
+  pull_request:
+    branches: [main]
+    paths:
+      - 'server/**'
+      - 'optimizations/**'
+      - '.github/workflows/speed-profile.yml'
+  workflow_dispatch:
+
+# Share the single physical 3090 with the existing gpu-tests job so two runs never
+# fight over the GPU. cancel-in-progress=false: let a queued profile finish rather
+# than killing it mid-measurement.
+concurrency:
+  group: lucebox3-gpu-runner
+  cancel-in-progress: false
+
+jobs:
+  speed-profile:
+    name: Speed profile (self-hosted RTX 3090, sm_86)
+    runs-on: [self-hosted, gpu, sm86]
+    timeout-minutes: 30
+    continue-on-error: true   # report-only: a slow/failed profile must not block the PR
+
+    # Model paths live on the runner, not in the repo (multi-GB weights). They are
+    # overridable via repo variables so the runner owner can point at whatever is
+    # staged without editing this workflow. Prefer the repo-documented Qwen3.6 GGUF
+    # draft path; keep the older LUCEBOX_SPEED_PROFILE_* variable names as aliases
+    # so existing repo settings continue to work.
+    env:
+      MODELS: ${{ vars.LUCEBOX_MODELS_DIR || '/opt/models' }}
+      TARGET_MODEL: ${{ vars.LUCEBOX_TARGET_MODEL || vars.LUCEBOX_SPEED_PROFILE_TARGET || 'Qwen3.6-27B-Q4_K_M.gguf' }}
+      DRAFT_MODEL: ${{ vars.LUCEBOX_DRAFT_MODEL || vars.LUCEBOX_SPEED_PROFILE_DRAFT || 'draft/dflash-draft-3.6-q4_k_m.gguf' }}
+      TOKENIZER: ${{ vars.LUCEBOX_TOKENIZER || vars.LUCEBOX_SPEED_PROFILE_TOKENIZER || 'Qwen/Qwen3.6-27B' }}
+
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          submodules: recursive
+          token: ${{ secrets.SUBMODULE_PAT || secrets.GITHUB_TOKEN }}
+
+      - name: GPU info (and pin clocks to cut variance, if permitted)
+        run: |
+          nvidia-smi --query-gpu=name,driver_version,memory.total,power.limit --format=csv
+          # Locking clocks makes the numbers comparable run-to-run. Safe to skip if the
+          # runner user can't run nvidia-smi -lgc; the profiler still records the power cap.
+          sudo nvidia-smi -lgc 1395 2>/dev/null || echo "clock lock not permitted; continuing"
+
+      - name: Check model weights are staged on the runner
+        id: models
+        run: |
+          # The weights are staged on the self-hosted runner out of band. If they are
+          # absent the engine binary aborts with a cryptic gguf "No such file" error, so
+          # check up front and SKIP cleanly instead — this job is report-only, and a
+          # missing model on the runner is an environment issue, not a PR defect.
+          target="$MODELS/$TARGET_MODEL"
+          draft="$MODELS/$DRAFT_MODEL"
+          present=true
+          missing_list=""
+          for f in "$target" "$draft"; do
+            if [ ! -f "$f" ]; then
+              present=false
+              missing_list="${missing_list}  - \`$f\`"$'\n'
+            fi
+          done
+          echo "present=$present" >> "$GITHUB_OUTPUT"
+          if [ "$present" = "false" ]; then
+            {
+              echo "## 🏎️ Speed profile — skipped (model weights not on runner)"
+              echo ""
+              echo "The profiler needs the target + draft weights staged on the self-hosted"
+              echo "runner, but these file(s) were not found:"
+              echo ""
+              printf '%s' "$missing_list"
+              echo ""
+              echo "Stage the weights at those paths, or set the repo variables"
+              echo "\`LUCEBOX_MODELS_DIR\` / \`LUCEBOX_TARGET_MODEL\` / \`LUCEBOX_DRAFT_MODEL\`"
+              echo "to point at where they live. Legacy \`LUCEBOX_SPEED_PROFILE_*\` variables also work."
+              echo "This job is report-only, so the PR is not blocked."
+              echo ""
+              echo "Available draft candidates under \`$MODELS\`:"
+              find "$MODELS" -maxdepth 4 -type f \( -name '*.gguf' -o -name '*.safetensors' \) -print 2>/dev/null | sort | sed 's/^/  - /' || true
+            } >> "$GITHUB_STEP_SUMMARY"
+            echo "::warning title=Speed profile skipped::Model weights not found under $MODELS — see the run summary."
+          fi
+
+      - name: Build engine binaries (sm_86, Release)
+        if: steps.models.outputs.present == 'true'
+        run: |
+          cd server
+          cmake -B build \
+            -DCMAKE_CUDA_ARCHITECTURES="86" \
+            -DDFLASH27B_ENABLE_BSA=OFF \
+            -DDFLASH27B_FA_ALL_QUANTS=OFF \
+            -DCMAKE_BUILD_TYPE=Release
+          cmake --build build --target test_dflash test_generate -j"$(nproc)"
+
+      - name: Install profiler Python deps (isolated, pinned venv)
+        if: steps.models.outputs.present == 'true'
+        run: |
+          cd server
+          # The profiler only needs a tokenizer, so we use a tiny isolated venv
+          # rather than installing into the shared runner's Python or pulling the
+          # full project env (torch, datasets, ...). Keep versions pinned so
+          # benchmark setup is reproducible and resilient to upstream releases.
+          python3 -m venv .profiler-venv
+          .profiler-venv/bin/pip install --quiet --upgrade pip==25.1.1
+          .profiler-venv/bin/pip install --quiet --require-virtualenv \
+            transformers==4.52.4 \
+            tokenizers==0.21.1 \
+            sentencepiece==0.2.0 \
+            tiktoken==0.9.0 \
+            protobuf==6.31.1
+
+      - name: Run speed profiler
+        if: steps.models.outputs.present == 'true'
+        run: |
+          cd server
+          # Use a committed baseline if one is staged so the report can flag a
+          # regression. Seed it once from a green `main` run's profile.json artifact
+          # (see docs/specs/speed-profile.md); without it the delta is simply skipped.
+          # The path is overridable so the runner owner can point elsewhere.
+          baseline="${LUCEBOX_SPEED_BASELINE:-scripts/speed-baseline.json}"
+          baseline_arg=()
+          if [ -f "$baseline" ]; then
+            baseline_arg=(--baseline "$baseline" --regress-pct "${LUCEBOX_SPEED_REGRESS_PCT:-0.10}")
+            echo "Regression check against baseline: $baseline"
+          else
+            echo "No baseline at $baseline — regression flagging disabled this run."
+          fi
+          # Use 128 generated tokens per prompt by default: long enough to reduce
+          # startup/noise effects while keeping the serialized 3090 queue bounded.
+          # The nsys pass adds a separate short profiled run; tok/s is measured on clean passes. Run 5
+          # timing reps by default so the report can distinguish real deltas from
+          # thermal/clock jitter; repo variables can trim this for temporary smoke runs.
+          .profiler-venv/bin/python scripts/profile.py \
+            --target "$MODELS/$TARGET_MODEL" \
+            --draft  "$MODELS/$DRAFT_MODEL" \
+            --tokenizer "$TOKENIZER" \
+            --n-gen "${LUCEBOX_SPEED_N_GEN:-128}" --budget 22 \
+            --reps "${LUCEBOX_SPEED_REPS:-5}" \
+            --noise-rsd-pct "${LUCEBOX_SPEED_NOISE_RSD_PCT:-0.05}" \
+            --nsys --check-lossless \
+            "${baseline_arg[@]}" \
+            --out-json profile.json --out-md profile.md
+        env:
+          LUCEBOX_SPEED_BASELINE: ${{ vars.LUCEBOX_SPEED_BASELINE || '' }}
+          LUCEBOX_SPEED_REGRESS_PCT: ${{ vars.LUCEBOX_SPEED_REGRESS_PCT || '' }}
+          LUCEBOX_SPEED_N_GEN: ${{ vars.LUCEBOX_SPEED_N_GEN || '' }}
+          LUCEBOX_SPEED_REPS: ${{ vars.LUCEBOX_SPEED_REPS || '' }}
+          LUCEBOX_SPEED_NOISE_RSD_PCT: ${{ vars.LUCEBOX_SPEED_NOISE_RSD_PCT || '' }}
+
+      - name: Publish report to the run summary
+        if: always() && steps.models.outputs.present == 'true'
+        run: |
+          if [ -f server/profile.md ]; then
+            { echo "## 🏎️ Speed profile"; echo ""; cat server/profile.md; } >> "$GITHUB_STEP_SUMMARY"
+          else
+            echo "Profiler produced no report (the run failed earlier — see logs)." >> "$GITHUB_STEP_SUMMARY"
+          fi
+
+      - name: Flag losslessness / regressions (annotations, non-blocking)
+        if: always() && steps.models.outputs.present == 'true'
+        run: |
+          [ -f server/profile.json ] || exit 0
+          # Report-only: emit warnings, never fail. A losslessness FAIL means the fast
+          # path changed the output and it is NOT run-to-run noise (AR agreed with
+          # itself) — worth triaging (real bug vs batched-verify FP). Inconclusive
+          # prompts (engine intrinsically nondeterministic) are NOT failures.
+          python3 - <<'PY'
+          import json
+          d = json.load(open("server/profile.json"))
+          ll, reg, noise = d.get("lossless", {}), d.get("regression", {}), d.get("summary", {}).get("noise", {})
+          if ll and not ll.get("lossless", True):
+              print(f"::warning title=Losslessness::spec-decode output differs from greedy AR on "
+                    f"{','.join(ll.get('prompts_failed', []))} (first token #{ll.get('first_divergence')}); "
+                    f"not run-to-run noise — triage bug vs batched-verify FP.")
+          if reg.get("regressed"):
+              print(f"::warning title=Speed regression::{','.join(reg.get('metrics', []))} moved past "
+                    f"±{reg.get('threshold_pct',0)*100:.0f}% vs baseline {reg.get('baseline_commit','?')}.")
+          if noise.get("noisy"):
+              print(f"::warning title=Noisy speed profile::{','.join(noise.get('metrics', []))} exceeded "
+                    f"the relative stddev threshold ({noise.get('threshold_rsd', 0)*100:.1f}%). "
+                    "Treat small deltas as below the profiler detection threshold.")
+          PY
+
+      - name: Reset GPU clocks
+        if: always()
+        run: sudo nvidia-smi -rgc 2>/dev/null || true
+
+      - name: Upload artifacts (json + markdown + nsys trace)
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: speed-profile-${{ github.run_id }}
+          path: |
+            server/profile.json
+            server/profile.md
+            server/profile.nsys-rep
+          if-no-files-found: warn
+          retention-days: 30
diff --git a/docs/specs/speed-profile.md b/docs/specs/speed-profile.md
@@ -0,0 +1,147 @@
+# Lucebox speed profiler (MVP)
+
+A small, CI-runnable profiler for the inference engine. It measures forward-pass
+speed on one GPU and produces a report that shows **where the time goes**, so a
+reviewer can see at a glance whether a PR moved the needle and where the next
+optimization margin is.
+
+It runs the engine **binaries directly** — `test_dflash` (the speculative / DFlash
+decode path) and `test_generate` (the plain autoregressive baseline) — with **no
+HTTP**, so the numbers reflect compute, not server/network noise. Both are CMake
+build targets under `server/build/` and can be overridden with `--df-bin` /
+`--ar-bin` (or the `DFLASH_BIN` / `DFLASH_BIN_AR` environment variables).
+
+## What it reports
+
+Three layers, coarse to fine:
+
+1. **Headline latency/throughput** — prefill time, model-side TTFT estimate, decode
+   tok/s, ms/token, plus the speculative-decoding **acceptance length (AL)** and
+   accept %. Repeated timing passes report **mean ± stddev**, not a single-point
+   estimate, so reviewers can tell whether a small delta is real or just jitter.
+   (`AL` is how many tokens the target commits per draft+verify step; decode
+   throughput ≈ `AL / step_time`.)
+2. **Per-step phase breakdown** — `draft_compute`, `draft_logits`, `verify_compute`,
+   … from the engine's own timers. Tells you *which phase* dominates a step.
+3. **Kernel-level (nsys)** — top CUDA kernels by GPU time, **kernel launches per
+   token** (kernel-fusion signal), **host↔device copy time per token** (CPU/GPU-overlap
+   signal), and sync-heavy CUDA APIs (CPU-stall signal). Tells you *why* a phase is slow.
+
+## Parameters (and why they are fixed defaults)
+
+The defaults mirror the **shipping config** so the numbers are production-representative,
+and they stay fixed so every run is comparable over time:
+
+- **`--budget 22`** — DDTree speculation budget = how many draft positions are verified
+  per target pass. 22 is the `dflash_server` default. Bigger = a bigger bet (higher
+  potential acceptance length) but more draft+verify cost; tuning it is a separate sweep,
+  not the CI job.
+- **`--n-gen 128`** — requested generated tokens per prompt. This is long enough
+  to amortize startup costs and reduce very-short-generation bias, while still
+  keeping the shared 3090 CI queue bounded. The argument is passed as `<n_gen>`
+  to both `test_dflash` and `test_generate`, where it controls the response length
+  generated for each benchmark prompt.
+- **`--reps 5`** — repeats each prompt enough times to expose run-to-run GPU
+  variance (thermals, clock boosting, scheduler jitter), then reports mean and
+  stddev for the headline metrics. Use `--reps 3` for a faster smoke profile when
+  needed, but PR comparisons should prefer 5+.
+- **`--noise-rsd-pct 0.05`** — report-only noise threshold. If any tracked
+  headline metric has relative stddev above 5%, the markdown calls the profile
+  **NOISY** and tells reviewers to treat small deltas as below the profiler
+  detection threshold.
+
+**Rule:** keep these consistent. A delta vs the baseline is only a valid regression
+signal if both runs used the same config — if you ever change a parameter, re-seed the
+baseline (you cannot compare across configs). When baseline and current 1σ intervals
+overlap and the delta is smaller than `--regress-pct`, the report marks that row as
+**noisy / overlap** instead of inviting reviewers to chase a ghost regression. All of
+these states are warnings only; the profiler remains report-only.
+
+## Losslessness gate (and why a bit-exact compare is too strict on its own)
+
+The gate checks that greedy speculative decode produces the same token stream as
+greedy autoregressive (AR) decode — a lossless-spec-decode claim should never change
+the model's output.
+
+A naive bit-exact compare **flags false failures**, and the engine itself explains
+why: the target sees draft tokens as a *batch* in the verify step but one-at-a-time in
+AR decode, and different GEMM shapes reduce in a different order. IEEE FP is
+non-associative, so when the top-2 logits sit within epsilon the argmax tie can flip —
+one token diverges and everything after it follows. See
+`server/src/qwen35/qwen35_backend.cpp` ("different GPU batch sizes → FP-nondeterministic
+state divergence → different greedy output") and `server/eval/README.md`, which runs an
+identical `baseline_2` config precisely because "cache-induced divergence and intrinsic
+noise are indistinguishable."
+
+So the gate runs a **determinism control**: a second, identical-config AR pass (reusing
+the AR baseline run — no extra GPU cost). For each prompt:
+
+| AR vs AR (control) | DFlash vs AR | verdict |
+|---|---|---|
+| identical | identical | **PASS** |
+| identical | diverges | **FAIL** — output changed and it is not run-to-run noise (triage needed) |
+| diverges | (either) | **inconclusive** — engine is intrinsically nondeterministic here, can't judge |
+
+The gate fails **only** on the middle row, which answers "real bug or too-strict check?":
+it no longer flags run-to-run noise (that becomes *inconclusive*). A FAIL means the fast
+path genuinely changed the output — but that is still not proven a logic bug: it can be
+the batched-verify FP effect above (verify scores draft tokens as a batch vs AR
+one-at-a-time). Classifying a FAIL as bug-vs-FP needs the **logit gap** at the first
+mismatch (near-tie = FP, clear gap = bug) — a follow-up the binaries don't emit yet. CI
+surfaces a FAIL as a non-blocking `::warning::` for triage; it stays report-only.
+
+## CI settings
+
+The `Speed Profile` workflow uses the same profiler defaults as the local recipe:
+`--n-gen 128`, `--reps 5`, and `--noise-rsd-pct 0.05`. Runner owners can
+temporarily override those values with repo variables `LUCEBOX_SPEED_N_GEN`,
+`LUCEBOX_SPEED_REPS`, and `LUCEBOX_SPEED_NOISE_RSD_PCT`, but PR-to-PR comparisons
+should keep them fixed.
+
+## Run it locally
+
+```bash
+cd server
+python3 scripts/profile.py \
+  --target /opt/models/Qwen3.6-27B-Q4_K_M.gguf \
+  --draft  /opt/models/draft/dflash-draft-3.6-q4_k_m.gguf \
+  --tokenizer Qwen/Qwen3.6-27B \
+  --n-gen 128 --budget 22 --reps 5 --noise-rsd-pct 0.05 \
+  --nsys --check-lossless \
+  --baseline scripts/speed-baseline.json --regress-pct 0.10 \
+  --out-json profile.json --out-md profile.md
+```
+
+## Comparing against a baseline
+
+The profiler is **report-only**, but it can diff the current run against a saved
+profile so reviewers see a single regression table instead of two separate reports.
+The comparison is a JSON round-trip:
+
+1. **Capture a baseline.** Run the profiler on the reference commit and keep its
+   JSON output:
+
+   ```bash
+   python3 scripts/profile.py ... --out-json scripts/speed-baseline.json
+   ```
+
+   Commit `scripts/speed-baseline.json` so every later run compares against the same
+   reference. Re-seed it whenever you change a profiler parameter (`--budget`,
+   `--n-gen`, `--reps`, …): you cannot compare across configs.
+
+2. **Compare a later run.** Point `--baseline` at that file and set the regression
+   threshold:
+
+   ```bash
+   python3 scripts/profile.py ... \
+     --baseline scripts/speed-baseline.json --regress-pct 0.10
+   ```
+
+3. **Read the delta.** The report adds a **"Delta vs baseline"** table with, per
+   headline metric, `baseline ± σ`, `now ± σ`, the absolute Δ and Δ%. A row is
+   flagged as a regression only when the move exceeds `--regress-pct` **and** the
+   baseline/current 1σ intervals do **not** overlap — a delta inside the noise band
+   is marked **noisy / overlap** instead, so reviewers don't chase jitter.
+
+Both runs must use the same parameters for the delta to be a valid signal (see the
+**Rule** under *Parameters* above).
diff --git a/server/scripts/profile.py b/server/scripts/profile.py