Skip to content

Commit c0beb2a

Browse files
authored
Add speed profiler (#431)
* Add report-only GPU speed profiling workflow * Add Lucebox speed profiler * Document speed profiler workflow * Fix review findings: losslessness gate, temp dir, nsys tokens - Losslessness gate now aggregates across all prompts (was only checking the first). A gate that can silently miss divergences on prompts 2+ is worse than no gate. - Temp dir changed to mkdtemp() per run (unique path, no shared /tmp/lucebox_profile). Low real-world risk since CI serializes via the concurrency group, but a clean fix regardless. - nsys per-token metrics now normalise by actual generated token count rather than requested n_gen. Note: per-token figures remain directional until the NVTX decode-window refactor (the nsys pass still covers load + prefill + decode together). * Fix review findings: clock reset, summary guard - Added "Reset GPU clocks" step (nvidia-smi -rgc, if: always()) so the lock set at job start never persists on the shared runner between PRs. - Guarded `cat server/profile.md` with a file-existence check so the summary step doesn't error when the profiler run fails. * Clean up mkdtemp run dir on exit The unique per-run temp dir (added to fix the earlier race) was never removed, so repeated CI runs would grow /tmp unbounded. Register an atexit cleanup that runs on success or crash; the kept artifacts (json/md/nsys-rep) are written outside the temp dir before cleanup. * Fix CI: install profiler Python deps before running The speed-profile job built the binaries but invoked profile.py with the bare system python3, which lacks `transformers` (used to tokenize prompts). Install it into an isolated venv so the shared runner's Python isn't polluted and the full project env (torch, etc.) isn't dragged in for a tokenizer-only dependency. * Fix speed profile model and pin deps * Update speed profile workflow for model paths and checks * Update profile.py * Revise speed profile documentation with new parameters Updated parameters and CI details for speed profiling, including losslessness gate and local run instructions. * Enhance speed profiling workflow with regression checks * Quote speed profile baseline args safely Build the optional baseline arguments as a Bash array and expand them with "${baseline_arg[@]}" so baseline paths containing spaces or shell metacharacters are passed to the profiler as intended. * Report partial nsys parsing failures Treat missing kernel-level nsys stats as a profiling error even when other nsys sections parse successfully, preventing zero kernel metrics from being rendered as real data. Surface non-critical nsys section failures as Markdown warnings while preserving valid kernel metrics. * Render nsys warnings with errors * Fix formatting in speed-profile.md * Use 128 generated tokens in speed profiler * Report run-to-run (within-prompt) stddev in profiler headline The headline ± pooled every per-rep sample across all prompts, so the large, repeatable prompt-to-prompt throughput spread leaked into the stddev and overstated run-to-run noise (e.g. decode 42.04 ± 12.14, and AL ± 1.43 even though per-prompt AL is bit-identical across reps). It also inflated the baseline 1σ band, making real regressions easier to dismiss as "overlap". Compute the headline stddev/rsd as the pooled within-prompt stddev so it reflects actual measurement noise. Prompt-to-prompt spread stays visible via the min–max line and the per-prompt table. The NOISY verdict already used per-prompt RSD and is unchanged. * Fix formatting in speed-profile.md * Enhance speed-profile documentation with baseline comparison Expanded explanation of profiler functionality and usage, including baseline comparison steps.
1 parent 26ca0dc commit c0beb2a

3 files changed

Lines changed: 1180 additions & 0 deletions

File tree

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
name: Speed Profile
2+
3+
# Report-only speed profile for the inference engine. Runs on the self-hosted
4+
# RTX 3090 (lucebox3) on PRs that touch the engine or the optimizations, and on
5+
# manual dispatch. It NEVER blocks a PR (continue-on-error: true) — it publishes a
6+
# report to the run summary + uploads the JSON / markdown / nsys trace as artifacts.
7+
#
8+
# Why report-only: perf has run-to-run variance (thermals, clocks, scheduling).
9+
# Gating a merge on a noisy absolute number produces false failures. We surface the
10+
# trend first; a soft threshold can come later once a baseline + variance band exist.
11+
12+
on:
13+
pull_request:
14+
branches: [main]
15+
paths:
16+
- 'server/**'
17+
- 'optimizations/**'
18+
- '.github/workflows/speed-profile.yml'
19+
workflow_dispatch:
20+
21+
# Share the single physical 3090 with the existing gpu-tests job so two runs never
22+
# fight over the GPU. cancel-in-progress=false: let a queued profile finish rather
23+
# than killing it mid-measurement.
24+
concurrency:
25+
group: lucebox3-gpu-runner
26+
cancel-in-progress: false
27+
28+
jobs:
29+
speed-profile:
30+
name: Speed profile (self-hosted RTX 3090, sm_86)
31+
runs-on: [self-hosted, gpu, sm86]
32+
timeout-minutes: 30
33+
continue-on-error: true # report-only: a slow/failed profile must not block the PR
34+
35+
# Model paths live on the runner, not in the repo (multi-GB weights). They are
36+
# overridable via repo variables so the runner owner can point at whatever is
37+
# staged without editing this workflow. Prefer the repo-documented Qwen3.6 GGUF
38+
# draft path; keep the older LUCEBOX_SPEED_PROFILE_* variable names as aliases
39+
# so existing repo settings continue to work.
40+
env:
41+
MODELS: ${{ vars.LUCEBOX_MODELS_DIR || '/opt/models' }}
42+
TARGET_MODEL: ${{ vars.LUCEBOX_TARGET_MODEL || vars.LUCEBOX_SPEED_PROFILE_TARGET || 'Qwen3.6-27B-Q4_K_M.gguf' }}
43+
DRAFT_MODEL: ${{ vars.LUCEBOX_DRAFT_MODEL || vars.LUCEBOX_SPEED_PROFILE_DRAFT || 'draft/dflash-draft-3.6-q4_k_m.gguf' }}
44+
TOKENIZER: ${{ vars.LUCEBOX_TOKENIZER || vars.LUCEBOX_SPEED_PROFILE_TOKENIZER || 'Qwen/Qwen3.6-27B' }}
45+
46+
steps:
47+
- uses: actions/checkout@v4
48+
with:
49+
submodules: recursive
50+
token: ${{ secrets.SUBMODULE_PAT || secrets.GITHUB_TOKEN }}
51+
52+
- name: GPU info (and pin clocks to cut variance, if permitted)
53+
run: |
54+
nvidia-smi --query-gpu=name,driver_version,memory.total,power.limit --format=csv
55+
# Locking clocks makes the numbers comparable run-to-run. Safe to skip if the
56+
# runner user can't run nvidia-smi -lgc; the profiler still records the power cap.
57+
sudo nvidia-smi -lgc 1395 2>/dev/null || echo "clock lock not permitted; continuing"
58+
59+
- name: Check model weights are staged on the runner
60+
id: models
61+
run: |
62+
# The weights are staged on the self-hosted runner out of band. If they are
63+
# absent the engine binary aborts with a cryptic gguf "No such file" error, so
64+
# check up front and SKIP cleanly instead — this job is report-only, and a
65+
# missing model on the runner is an environment issue, not a PR defect.
66+
target="$MODELS/$TARGET_MODEL"
67+
draft="$MODELS/$DRAFT_MODEL"
68+
present=true
69+
missing_list=""
70+
for f in "$target" "$draft"; do
71+
if [ ! -f "$f" ]; then
72+
present=false
73+
missing_list="${missing_list} - \`$f\`"$'\n'
74+
fi
75+
done
76+
echo "present=$present" >> "$GITHUB_OUTPUT"
77+
if [ "$present" = "false" ]; then
78+
{
79+
echo "## 🏎️ Speed profile — skipped (model weights not on runner)"
80+
echo ""
81+
echo "The profiler needs the target + draft weights staged on the self-hosted"
82+
echo "runner, but these file(s) were not found:"
83+
echo ""
84+
printf '%s' "$missing_list"
85+
echo ""
86+
echo "Stage the weights at those paths, or set the repo variables"
87+
echo "\`LUCEBOX_MODELS_DIR\` / \`LUCEBOX_TARGET_MODEL\` / \`LUCEBOX_DRAFT_MODEL\`"
88+
echo "to point at where they live. Legacy \`LUCEBOX_SPEED_PROFILE_*\` variables also work."
89+
echo "This job is report-only, so the PR is not blocked."
90+
echo ""
91+
echo "Available draft candidates under \`$MODELS\`:"
92+
find "$MODELS" -maxdepth 4 -type f \( -name '*.gguf' -o -name '*.safetensors' \) -print 2>/dev/null | sort | sed 's/^/ - /' || true
93+
} >> "$GITHUB_STEP_SUMMARY"
94+
echo "::warning title=Speed profile skipped::Model weights not found under $MODELS — see the run summary."
95+
fi
96+
97+
- name: Build engine binaries (sm_86, Release)
98+
if: steps.models.outputs.present == 'true'
99+
run: |
100+
cd server
101+
cmake -B build \
102+
-DCMAKE_CUDA_ARCHITECTURES="86" \
103+
-DDFLASH27B_ENABLE_BSA=OFF \
104+
-DDFLASH27B_FA_ALL_QUANTS=OFF \
105+
-DCMAKE_BUILD_TYPE=Release
106+
cmake --build build --target test_dflash test_generate -j"$(nproc)"
107+
108+
- name: Install profiler Python deps (isolated, pinned venv)
109+
if: steps.models.outputs.present == 'true'
110+
run: |
111+
cd server
112+
# The profiler only needs a tokenizer, so we use a tiny isolated venv
113+
# rather than installing into the shared runner's Python or pulling the
114+
# full project env (torch, datasets, ...). Keep versions pinned so
115+
# benchmark setup is reproducible and resilient to upstream releases.
116+
python3 -m venv .profiler-venv
117+
.profiler-venv/bin/pip install --quiet --upgrade pip==25.1.1
118+
.profiler-venv/bin/pip install --quiet --require-virtualenv \
119+
transformers==4.52.4 \
120+
tokenizers==0.21.1 \
121+
sentencepiece==0.2.0 \
122+
tiktoken==0.9.0 \
123+
protobuf==6.31.1
124+
125+
- name: Run speed profiler
126+
if: steps.models.outputs.present == 'true'
127+
run: |
128+
cd server
129+
# Use a committed baseline if one is staged so the report can flag a
130+
# regression. Seed it once from a green `main` run's profile.json artifact
131+
# (see docs/specs/speed-profile.md); without it the delta is simply skipped.
132+
# The path is overridable so the runner owner can point elsewhere.
133+
baseline="${LUCEBOX_SPEED_BASELINE:-scripts/speed-baseline.json}"
134+
baseline_arg=()
135+
if [ -f "$baseline" ]; then
136+
baseline_arg=(--baseline "$baseline" --regress-pct "${LUCEBOX_SPEED_REGRESS_PCT:-0.10}")
137+
echo "Regression check against baseline: $baseline"
138+
else
139+
echo "No baseline at $baseline — regression flagging disabled this run."
140+
fi
141+
# Use 128 generated tokens per prompt by default: long enough to reduce
142+
# startup/noise effects while keeping the serialized 3090 queue bounded.
143+
# The nsys pass adds a separate short profiled run; tok/s is measured on clean passes. Run 5
144+
# timing reps by default so the report can distinguish real deltas from
145+
# thermal/clock jitter; repo variables can trim this for temporary smoke runs.
146+
.profiler-venv/bin/python scripts/profile.py \
147+
--target "$MODELS/$TARGET_MODEL" \
148+
--draft "$MODELS/$DRAFT_MODEL" \
149+
--tokenizer "$TOKENIZER" \
150+
--n-gen "${LUCEBOX_SPEED_N_GEN:-128}" --budget 22 \
151+
--reps "${LUCEBOX_SPEED_REPS:-5}" \
152+
--noise-rsd-pct "${LUCEBOX_SPEED_NOISE_RSD_PCT:-0.05}" \
153+
--nsys --check-lossless \
154+
"${baseline_arg[@]}" \
155+
--out-json profile.json --out-md profile.md
156+
env:
157+
LUCEBOX_SPEED_BASELINE: ${{ vars.LUCEBOX_SPEED_BASELINE || '' }}
158+
LUCEBOX_SPEED_REGRESS_PCT: ${{ vars.LUCEBOX_SPEED_REGRESS_PCT || '' }}
159+
LUCEBOX_SPEED_N_GEN: ${{ vars.LUCEBOX_SPEED_N_GEN || '' }}
160+
LUCEBOX_SPEED_REPS: ${{ vars.LUCEBOX_SPEED_REPS || '' }}
161+
LUCEBOX_SPEED_NOISE_RSD_PCT: ${{ vars.LUCEBOX_SPEED_NOISE_RSD_PCT || '' }}
162+
163+
- name: Publish report to the run summary
164+
if: always() && steps.models.outputs.present == 'true'
165+
run: |
166+
if [ -f server/profile.md ]; then
167+
{ echo "## 🏎️ Speed profile"; echo ""; cat server/profile.md; } >> "$GITHUB_STEP_SUMMARY"
168+
else
169+
echo "Profiler produced no report (the run failed earlier — see logs)." >> "$GITHUB_STEP_SUMMARY"
170+
fi
171+
172+
- name: Flag losslessness / regressions (annotations, non-blocking)
173+
if: always() && steps.models.outputs.present == 'true'
174+
run: |
175+
[ -f server/profile.json ] || exit 0
176+
# Report-only: emit warnings, never fail. A losslessness FAIL means the fast
177+
# path changed the output and it is NOT run-to-run noise (AR agreed with
178+
# itself) — worth triaging (real bug vs batched-verify FP). Inconclusive
179+
# prompts (engine intrinsically nondeterministic) are NOT failures.
180+
python3 - <<'PY'
181+
import json
182+
d = json.load(open("server/profile.json"))
183+
ll, reg, noise = d.get("lossless", {}), d.get("regression", {}), d.get("summary", {}).get("noise", {})
184+
if ll and not ll.get("lossless", True):
185+
print(f"::warning title=Losslessness::spec-decode output differs from greedy AR on "
186+
f"{','.join(ll.get('prompts_failed', []))} (first token #{ll.get('first_divergence')}); "
187+
f"not run-to-run noise — triage bug vs batched-verify FP.")
188+
if reg.get("regressed"):
189+
print(f"::warning title=Speed regression::{','.join(reg.get('metrics', []))} moved past "
190+
f"±{reg.get('threshold_pct',0)*100:.0f}% vs baseline {reg.get('baseline_commit','?')}.")
191+
if noise.get("noisy"):
192+
print(f"::warning title=Noisy speed profile::{','.join(noise.get('metrics', []))} exceeded "
193+
f"the relative stddev threshold ({noise.get('threshold_rsd', 0)*100:.1f}%). "
194+
"Treat small deltas as below the profiler detection threshold.")
195+
PY
196+
197+
- name: Reset GPU clocks
198+
if: always()
199+
run: sudo nvidia-smi -rgc 2>/dev/null || true
200+
201+
- name: Upload artifacts (json + markdown + nsys trace)
202+
if: always()
203+
uses: actions/upload-artifact@v4
204+
with:
205+
name: speed-profile-${{ github.run_id }}
206+
path: |
207+
server/profile.json
208+
server/profile.md
209+
server/profile.nsys-rep
210+
if-no-files-found: warn
211+
retention-days: 30

docs/specs/speed-profile.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Lucebox speed profiler (MVP)
2+
3+
A small, CI-runnable profiler for the inference engine. It measures forward-pass
4+
speed on one GPU and produces a report that shows **where the time goes**, so a
5+
reviewer can see at a glance whether a PR moved the needle and where the next
6+
optimization margin is.
7+
8+
It runs the engine **binaries directly**`test_dflash` (the speculative / DFlash
9+
decode path) and `test_generate` (the plain autoregressive baseline) — with **no
10+
HTTP**, so the numbers reflect compute, not server/network noise. Both are CMake
11+
build targets under `server/build/` and can be overridden with `--df-bin` /
12+
`--ar-bin` (or the `DFLASH_BIN` / `DFLASH_BIN_AR` environment variables).
13+
14+
## What it reports
15+
16+
Three layers, coarse to fine:
17+
18+
1. **Headline latency/throughput** — prefill time, model-side TTFT estimate, decode
19+
tok/s, ms/token, plus the speculative-decoding **acceptance length (AL)** and
20+
accept %. Repeated timing passes report **mean ± stddev**, not a single-point
21+
estimate, so reviewers can tell whether a small delta is real or just jitter.
22+
(`AL` is how many tokens the target commits per draft+verify step; decode
23+
throughput ≈ `AL / step_time`.)
24+
2. **Per-step phase breakdown**`draft_compute`, `draft_logits`, `verify_compute`,
25+
… from the engine's own timers. Tells you *which phase* dominates a step.
26+
3. **Kernel-level (nsys)** — top CUDA kernels by GPU time, **kernel launches per
27+
token** (kernel-fusion signal), **host↔device copy time per token** (CPU/GPU-overlap
28+
signal), and sync-heavy CUDA APIs (CPU-stall signal). Tells you *why* a phase is slow.
29+
30+
## Parameters (and why they are fixed defaults)
31+
32+
The defaults mirror the **shipping config** so the numbers are production-representative,
33+
and they stay fixed so every run is comparable over time:
34+
35+
- **`--budget 22`** — DDTree speculation budget = how many draft positions are verified
36+
per target pass. 22 is the `dflash_server` default. Bigger = a bigger bet (higher
37+
potential acceptance length) but more draft+verify cost; tuning it is a separate sweep,
38+
not the CI job.
39+
- **`--n-gen 128`** — requested generated tokens per prompt. This is long enough
40+
to amortize startup costs and reduce very-short-generation bias, while still
41+
keeping the shared 3090 CI queue bounded. The argument is passed as `<n_gen>`
42+
to both `test_dflash` and `test_generate`, where it controls the response length
43+
generated for each benchmark prompt.
44+
- **`--reps 5`** — repeats each prompt enough times to expose run-to-run GPU
45+
variance (thermals, clock boosting, scheduler jitter), then reports mean and
46+
stddev for the headline metrics. Use `--reps 3` for a faster smoke profile when
47+
needed, but PR comparisons should prefer 5+.
48+
- **`--noise-rsd-pct 0.05`** — report-only noise threshold. If any tracked
49+
headline metric has relative stddev above 5%, the markdown calls the profile
50+
**NOISY** and tells reviewers to treat small deltas as below the profiler
51+
detection threshold.
52+
53+
**Rule:** keep these consistent. A delta vs the baseline is only a valid regression
54+
signal if both runs used the same config — if you ever change a parameter, re-seed the
55+
baseline (you cannot compare across configs). When baseline and current 1σ intervals
56+
overlap and the delta is smaller than `--regress-pct`, the report marks that row as
57+
**noisy / overlap** instead of inviting reviewers to chase a ghost regression. All of
58+
these states are warnings only; the profiler remains report-only.
59+
60+
## Losslessness gate (and why a bit-exact compare is too strict on its own)
61+
62+
The gate checks that greedy speculative decode produces the same token stream as
63+
greedy autoregressive (AR) decode — a lossless-spec-decode claim should never change
64+
the model's output.
65+
66+
A naive bit-exact compare **flags false failures**, and the engine itself explains
67+
why: the target sees draft tokens as a *batch* in the verify step but one-at-a-time in
68+
AR decode, and different GEMM shapes reduce in a different order. IEEE FP is
69+
non-associative, so when the top-2 logits sit within epsilon the argmax tie can flip —
70+
one token diverges and everything after it follows. See
71+
`server/src/qwen35/qwen35_backend.cpp` ("different GPU batch sizes → FP-nondeterministic
72+
state divergence → different greedy output") and `server/eval/README.md`, which runs an
73+
identical `baseline_2` config precisely because "cache-induced divergence and intrinsic
74+
noise are indistinguishable."
75+
76+
So the gate runs a **determinism control**: a second, identical-config AR pass (reusing
77+
the AR baseline run — no extra GPU cost). For each prompt:
78+
79+
| AR vs AR (control) | DFlash vs AR | verdict |
80+
|---|---|---|
81+
| identical | identical | **PASS** |
82+
| identical | diverges | **FAIL** — output changed and it is not run-to-run noise (triage needed) |
83+
| diverges | (either) | **inconclusive** — engine is intrinsically nondeterministic here, can't judge |
84+
85+
The gate fails **only** on the middle row, which answers "real bug or too-strict check?":
86+
it no longer flags run-to-run noise (that becomes *inconclusive*). A FAIL means the fast
87+
path genuinely changed the output — but that is still not proven a logic bug: it can be
88+
the batched-verify FP effect above (verify scores draft tokens as a batch vs AR
89+
one-at-a-time). Classifying a FAIL as bug-vs-FP needs the **logit gap** at the first
90+
mismatch (near-tie = FP, clear gap = bug) — a follow-up the binaries don't emit yet. CI
91+
surfaces a FAIL as a non-blocking `::warning::` for triage; it stays report-only.
92+
93+
## CI settings
94+
95+
The `Speed Profile` workflow uses the same profiler defaults as the local recipe:
96+
`--n-gen 128`, `--reps 5`, and `--noise-rsd-pct 0.05`. Runner owners can
97+
temporarily override those values with repo variables `LUCEBOX_SPEED_N_GEN`,
98+
`LUCEBOX_SPEED_REPS`, and `LUCEBOX_SPEED_NOISE_RSD_PCT`, but PR-to-PR comparisons
99+
should keep them fixed.
100+
101+
## Run it locally
102+
103+
```bash
104+
cd server
105+
python3 scripts/profile.py \
106+
--target /opt/models/Qwen3.6-27B-Q4_K_M.gguf \
107+
--draft /opt/models/draft/dflash-draft-3.6-q4_k_m.gguf \
108+
--tokenizer Qwen/Qwen3.6-27B \
109+
--n-gen 128 --budget 22 --reps 5 --noise-rsd-pct 0.05 \
110+
--nsys --check-lossless \
111+
--baseline scripts/speed-baseline.json --regress-pct 0.10 \
112+
--out-json profile.json --out-md profile.md
113+
```
114+
115+
## Comparing against a baseline
116+
117+
The profiler is **report-only**, but it can diff the current run against a saved
118+
profile so reviewers see a single regression table instead of two separate reports.
119+
The comparison is a JSON round-trip:
120+
121+
1. **Capture a baseline.** Run the profiler on the reference commit and keep its
122+
JSON output:
123+
124+
```bash
125+
python3 scripts/profile.py ... --out-json scripts/speed-baseline.json
126+
```
127+
128+
Commit `scripts/speed-baseline.json` so every later run compares against the same
129+
reference. Re-seed it whenever you change a profiler parameter (`--budget`,
130+
`--n-gen`, `--reps`, …): you cannot compare across configs.
131+
132+
2. **Compare a later run.** Point `--baseline` at that file and set the regression
133+
threshold:
134+
135+
```bash
136+
python3 scripts/profile.py ... \
137+
--baseline scripts/speed-baseline.json --regress-pct 0.10
138+
```
139+
140+
3. **Read the delta.** The report adds a **"Delta vs baseline"** table with, per
141+
headline metric, `baseline ± σ`, `now ± σ`, the absolute Δ and Δ%. A row is
142+
flagged as a regression only when the move exceeds `--regress-pct` **and** the
143+
baseline/current 1σ intervals do **not** overlap — a delta inside the noise band
144+
is marked **noisy / overlap** instead, so reviewers don't chase jitter.
145+
146+
Both runs must use the same parameters for the delta to be a valid signal (see the
147+
**Rule** under *Parameters* above).

0 commit comments

Comments
 (0)