Skip to content

Commit 9addf61

Browse files
committed
feat(guardrails): Benchmark analysis in-progress
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
1 parent c48f52c commit 9addf61

11 files changed

Lines changed: 826 additions & 54 deletions

File tree

.github/workflows/ci.yaml

Lines changed: 81 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1072,10 +1072,21 @@ jobs:
10721072
path: web/packages/studio/playwright-report/
10731073

10741074
benchmark-guardrails:
1075-
name: Guardrails plugin benchmark
1075+
# Run the two benchmark variants as parallel matrix jobs, each with its
1076+
# own NMP instance. This isolates them from each other (no shared mocks,
1077+
# no cross-talk on :8080) and roughly halves wall-clock vs. the previous
1078+
# single-job sequential layout. The `benchmark-guardrails-analyze` job
1079+
# below merges both artifacts and prints the with-vs-without comparison.
1080+
name: Guardrails plugin benchmark (${{ matrix.variant }})
10761081
if: github.event_name == 'workflow_dispatch'
10771082
runs-on: ubuntu-latest
10781083
timeout-minutes: 30
1084+
strategy:
1085+
# Don't cancel the other variant if one fails; the partial artifact
1086+
# is still useful for diagnosing what went wrong.
1087+
fail-fast: false
1088+
matrix:
1089+
variant: [with-guardrails, without-guardrails]
10791090
steps:
10801091
- name: Checkout nemo-platform
10811092
uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
@@ -1102,15 +1113,81 @@ jobs:
11021113
PYTORCH_DEPS: cpu
11031114
- name: Run benchmark sweep
11041115
working-directory: nemo-platform
1105-
run: make benchmark-guardrails
1116+
# Pin both variants to the same `--run-id` so when the analyze job
1117+
# downloads both artifacts into one `runs/` parent, they merge into
1118+
# a single run directory the analyzer can read normally.
1119+
run: |
1120+
make benchmark-guardrails BENCHMARK_ARGS="\
1121+
--variant ${{ matrix.variant }} \
1122+
--run-id ci-${{ github.run_id }}-${{ github.run_attempt }}"
11061123
env:
11071124
NEMO_GUARDRAILS_REPO_ROOT: ${{ github.workspace }}/NeMo-Guardrails
11081125
_TYPER_FORCE_DISABLE_TERMINAL: "1"
11091126
- name: Upload benchmark artifacts
11101127
if: always()
11111128
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
11121129
with:
1113-
name: benchmark-guardrails-results
1130+
# Per-variant artifact name; GHA disallows two artifacts with the
1131+
# same name in one workflow run.
1132+
name: benchmark-guardrails-results-${{ matrix.variant }}
1133+
retention-days: 30
1134+
path: |
1135+
nemo-platform/plugins/nemo-guardrails/benchmarks/artifacts/runs/
1136+
1137+
benchmark-guardrails-analyze:
1138+
# Joins the two parallel matrix jobs above, downloads both artifacts into
1139+
# one merged `runs/<id>/` tree, and prints the with-vs-without
1140+
# comparison table. No regression gate yet: this exists to produce
1141+
# CI-collected numbers we can use to seed a baseline file later.
1142+
name: Guardrails benchmark analysis
1143+
needs: [benchmark-guardrails]
1144+
if: github.event_name == 'workflow_dispatch' && !cancelled()
1145+
runs-on: ubuntu-latest
1146+
timeout-minutes: 10
1147+
steps:
1148+
- name: Checkout nemo-platform
1149+
uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
1150+
with:
1151+
path: nemo-platform
1152+
- name: Download with-guardrails artifact
1153+
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1
1154+
with:
1155+
name: benchmark-guardrails-results-with-guardrails
1156+
path: nemo-platform/plugins/nemo-guardrails/benchmarks/artifacts/runs/
1157+
- name: Download without-guardrails artifact
1158+
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1
1159+
with:
1160+
name: benchmark-guardrails-results-without-guardrails
1161+
path: nemo-platform/plugins/nemo-guardrails/benchmarks/artifacts/runs/
1162+
- name: Install uv
1163+
uses: astral-sh/setup-uv@37802adc94f370d6bfd71619e3f0bf239e1f3b78 # v7.6.0
1164+
with:
1165+
working-directory: nemo-platform
1166+
python-version: "3.11"
1167+
enable-cache: true
1168+
- name: Bootstrap Python environment
1169+
working-directory: nemo-platform
1170+
run: make bootstrap-python
1171+
env:
1172+
PYTORCH_DEPS: cpu
1173+
- name: Print benchmark comparison
1174+
working-directory: nemo-platform
1175+
# Both matrix jobs above use the same `--run-id`, so there should be
1176+
# exactly one merged run directory under `runs/`. `ls -td ... | head -1`
1177+
# is defensive in case that ever stops being true (e.g. an artifact
1178+
# ever leaks in from a previous workflow attempt).
1179+
run: |
1180+
RUN_DIR=$(ls -td plugins/nemo-guardrails/benchmarks/artifacts/runs/*/ | head -1)
1181+
echo "Analyzing run directory: $RUN_DIR"
1182+
uv run --frozen --package nemo-guardrails-plugin --extra bench \
1183+
python -m nemo_guardrails_plugin.benchmarks.analyze "$RUN_DIR"
1184+
- name: Upload merged benchmark artifacts
1185+
if: always()
1186+
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
1187+
with:
1188+
# Single merged artifact so collecting baseline samples is a matter
1189+
# of downloading one thing per workflow run rather than two.
1190+
name: benchmark-guardrails-results-merged
11141191
retention-days: 30
11151192
path: |
11161193
nemo-platform/plugins/nemo-guardrails/benchmarks/artifacts/runs/
@@ -1262,6 +1339,7 @@ jobs:
12621339
- web-studio-deps
12631340
- web-studio-e2e
12641341
- benchmark-guardrails
1342+
- benchmark-guardrails-analyze
12651343
- opa-policy-test
12661344
if: always()
12671345
runs-on: ubuntu-latest
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Mock LLM configurations
2+
3+
These `.env` files configure the upstream `benchmark.mock_llm_server.run_server`
4+
(from the `NeMo-Guardrails` checkout) for the IGW guardrails benchmark.
5+
6+
We keep our own copies (instead of pointing at the upstream checkout's
7+
`benchmark/mock_llm_server/configs/`) so:
8+
9+
- We can change mock latency without touching the upstream repo. The original
10+
motivation was tuning `E2E_LATENCY_*` to isolate NMP+middleware overhead
11+
from mandatory NIM sleep (see the benchmark README for the full rationale).
12+
- The exact mock behavior we benchmarked against is versioned alongside the
13+
results, so historical numbers stay reproducible even if upstream changes
14+
its defaults.
15+
16+
Initial contents are a verbatim copy of the upstream files:
17+
18+
- `app-llm.env` ← upstream `meta-llama-3.3-70b-instruct.env`
19+
- `content-safety-llm.env` ← upstream `nvidia-llama-3.1-nemoguard-8b-content-safety.env`
20+
21+
Update either file to change mock behavior for the next benchmark run.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
MODEL="meta/llama-3.3-70b-instruct"
2+
UNSAFE_PROBABILITY=0.0
3+
UNSAFE_TEXT="I can't help with that. Is there anything else I can assist you with?"
4+
SAFE_TEXT="I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities."
5+
# End-to-end latency
6+
E2E_LATENCY_MIN_SECONDS=4.0
7+
E2E_LATENCY_MAX_SECONDS=4.0
8+
E2E_LATENCY_MEAN_SECONDS=4.0
9+
E2E_LATENCY_STD_SECONDS=0.0
10+
# Streaming latency: Time to First Token (TTFT)
11+
TTFT_MIN_SECONDS=0.3
12+
TTFT_MAX_SECONDS=0.3
13+
TTFT_MEAN_SECONDS=0.3
14+
TTFT_STD_SECONDS=0.0
15+
# Streaming latency: Chunk Latency (ITL)
16+
CHUNK_LATENCY_MIN_SECONDS=0.015
17+
CHUNK_LATENCY_MAX_SECONDS=0.015
18+
CHUNK_LATENCY_MEAN_SECONDS=0.015
19+
CHUNK_LATENCY_STD_SECONDS=0.0
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
MODEL="nvidia/llama-3.1-nemoguard-8b-content-safety"
2+
UNSAFE_PROBABILITY=0.0
3+
UNSAFE_TEXT="{\"User Safety\": \"unsafe\", \"Response Safety\": \"unsafe\", \"Safety Categories\": \"Violence, Criminal Planning/Confessions\"}"
4+
SAFE_TEXT="{\"User Safety\": \"safe\", \"Response Safety\": \"safe\"}"
5+
# End-to-end latency
6+
E2E_LATENCY_MIN_SECONDS=0.5
7+
E2E_LATENCY_MAX_SECONDS=0.5
8+
E2E_LATENCY_MEAN_SECONDS=0.5
9+
E2E_LATENCY_STD_SECONDS=0.0
10+
# Streaming latency: Time to First Token (TTFT)
11+
TTFT_MIN_SECONDS=0.2
12+
TTFT_MAX_SECONDS=0.2
13+
TTFT_MEAN_SECONDS=0.2
14+
TTFT_STD_SECONDS=0.0
15+
# Streaming latency: Chunk Latency (ITL)
16+
CHUNK_LATENCY_MIN_SECONDS=0.015
17+
CHUNK_LATENCY_MAX_SECONDS=0.015
18+
CHUNK_LATENCY_MEAN_SECONDS=0.015
19+
CHUNK_LATENCY_STD_SECONDS=0.0
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Local baseline results — 2026-06-16
2+
3+
Three back-to-back runs of `make benchmark-guardrails` on a local MacBook Pro
4+
(Apple Silicon), no other heavy workloads running. Goal: characterize the
5+
run-to-run variance of the new with-guardrails / without-guardrails harness so
6+
we can decide what's gateable in CI.
7+
8+
## Hardware / setup
9+
10+
- Host: MacBook Pro, Apple Silicon, on AC power
11+
- NMP, mocks, shim: all on localhost
12+
- Mock LLM config: in-repo defaults (`plugins/nemo-guardrails/benchmarks/configs/mock_llm/`)
13+
- app LLM: 4.0s e2e latency, std 0
14+
- content-safety LLM: 0.5s e2e latency, std 0
15+
- AIPerf sweep: concurrency `[1, 2, 4, 8, 16, 32, 64]`, `benchmark_duration: 60s`,
16+
`warmup_request_count: 10`, non-streaming chat completions
17+
- Mock workers: 4 (default)
18+
- Three runs in the same afternoon, NMP data dir reused across runs
19+
20+
## Run inventory
21+
22+
| Run | Run dir | Notes |
23+
|---|---|---|
24+
| 1 | `20260616_123851` | first run after the with/without harness change |
25+
| 2 | `20260616_145058` | identical config |
26+
| 3 | `20260616_152834` | identical config |
27+
28+
All three runs completed with 7/7 sweeps passing per variant, exit code 0.
29+
30+
## Δp50 (with-guardrails − without-guardrails), milliseconds
31+
32+
This is the headline metric: how much wall-clock time the guardrails middleware
33+
adds on top of the bare NMP+IGW path, including the two content-safety LLM
34+
round-trips that the rails cause but don't do themselves.
35+
36+
| Run | c=1 | c=2 | c=4 | c=8 | c=16 | c=32 | c=64 |
37+
|---------|-----:|-----:|-----:|-----:|-----:|-----:|--------:|
38+
| Run 1 | 1029 | 1071 | 1068 | 1104 | 1145 | 1260 | 778 |
39+
| Run 2 | 1027 | 1062 | 1096 | 1105 | 1226 | 1256 | -2896 |
40+
| Run 3 | 1030 | 1062 | 1079 | 1070 | 1118 | 1201 | -2077 |
41+
| **mean**| **1029** | **1065** | **1081** | **1093** | **1163** | **1239** | **−1398** |
42+
| range | 3 | 9 | 28 | 35 | 108 | 59 | 3674 |
43+
| range % | 0.3% | 0.8% | 2.6% | 3.2% | 9.3% | 4.8% | n/a |
44+
45+
## with-guardrails p50 (absolute), milliseconds
46+
47+
Useful as a sanity check that nothing catastrophic shifted in the absolute
48+
numbers — even if Δp50 stays steady, both variants could slow down together.
49+
50+
| Run | c=1 | c=2 | c=4 | c=8 | c=16 | c=32 | c=64 |
51+
|---------|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
52+
| Run 1 | 5049 | 5101 | 5114 | 5152 | 5201 | 5318 | 6164 |
53+
| Run 2 | 5048 | 5093 | 5125 | 5137 | 5255 | 5279 | 5614 |
54+
| Run 3 | 5050 | 5094 | 5123 | 5146 | 5163 | 5250 | 5486 |
55+
| **mean**| **5049** | **5096** | **5121** | **5145** | **5206** | **5282** | **5755** |
56+
| range | 2 | 8 | 11 | 15 | 92 | 68 | 678 |
57+
| range % | 0.0% | 0.2% | 0.2% | 0.3% | 1.8% | 1.3% | 11.8%|
58+
59+
## without-guardrails p50 (absolute), milliseconds
60+
61+
For completeness. This is the variant that's wildly unstable at c=64.
62+
63+
| Run | c=1 | c=2 | c=4 | c=8 | c=16 | c=32 | c=64 |
64+
|---------|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
65+
| Run 1 | 4020 | 4030 | 4045 | 4048 | 4056 | 4058 | 5386 |
66+
| Run 2 | 4020 | 4031 | 4029 | 4032 | 4029 | 4023 | 8510 |
67+
| Run 3 | 4020 | 4032 | 4044 | 4076 | 4045 | 4049 | 7563 |
68+
| **mean**| **4020** | **4031** | **4039** | **4052** | **4043** | **4043** | **7153** |
69+
| range | 0 | 2 | 16 | 44 | 27 | 35 | 3124 |
70+
71+
The app mock sleeps for exactly 4.0s. The ~20–80 ms above 4000 across c=1–c=32
72+
is pure NMP+IGW+shim overhead. At c=64 the mock saturates (4 workers × 1 req/4s
73+
= 4 RPS ceiling, vs. 64 requested in-flight) and requests queue.
74+
75+
## p90 — informational only
76+
77+
p90 is much noisier than p50 across runs. Not gateable with three samples.
78+
79+
### Δp90, milliseconds
80+
81+
| Run | c=1 | c=2 | c=4 | c=8 | c=16 | c=32 | c=64 |
82+
|-------|-----:|-----:|-----:|-----:|-----:|-----:|------:|
83+
| Run 1 | 1039 | 1099 | 1162 | 1025 | 911 | 604 | 3009 |
84+
| Run 2 | 1028 | 1115 | 1160 | 1262 | 783 | 641 | 1015 |
85+
| Run 3 | 1023 | 1076 | 1189 | 1085 | 1209 | 18 | 1998 |
86+
87+
## Observations
88+
89+
### What's stable enough to gate on
90+
91+
**c=1, 2, 4, 8.** The Δp50 ranges are 3–35 ms, well under any tolerance we'd
92+
realistically write. The absolute with-guardrails p50 is even tighter (2–15 ms
93+
across three runs). This is the regime where the harness is genuinely measuring
94+
what we want: NMP+middleware overhead on top of fixed-latency mocks.
95+
96+
### What's borderline
97+
98+
**c=16.** Δp50 range is 9.3%. Gateable with a generous tolerance (~10%+) but
99+
adds limited signal beyond c=8.
100+
101+
### What's not gateable
102+
103+
**c=32.** ~5% Δp50 range. Still bounded, but the run-to-run distance is
104+
several times larger than at c=1–c=8 and the absolute numbers wobble too.
105+
106+
**c=64.** Unusable. Δp50 swings from +778 to −2896 across three runs.
107+
Root cause is the app mock's 4-worker saturation at this load level: the
108+
without-guardrails path fires app requests as fast as it can and the mock queues
109+
unpredictably. The with-guardrails path's CS-mock work paces requests enough to
110+
hide most of this. This is a test-rig artifact, not an NMP behavior.
111+
112+
### Side observation: middleware overhead is small
113+
114+
Of the ~1029 ms Δp50 at c=1:
115+
- ~1000 ms is the two content-safety mock round-trips (0.5s each, mandatory).
116+
- ~29 ms is the middleware's *own* work (rails orchestration, request/response
117+
shaping, etc.) plus bare NMP+IGW overhead delta vs. without-guardrails.
118+
119+
The without-guardrails baseline of ~4020 ms at c=1 against a 4000 ms mock means
120+
**bare NMP+IGW+shim overhead is ~20 ms** at idle.
121+
122+
## Recommendation for the CI gate
123+
124+
Based on the variance data above:
125+
126+
| Concurrency | Gate Δp50? | Gate absolute with-guardrails p50? | Notes |
127+
|---|---|---|---|
128+
| 1 | yes | yes | tightest signal |
129+
| 2 | yes | yes | |
130+
| 4 | yes | yes | |
131+
| 8 | yes | yes | |
132+
| 16 | informational | informational | record but don't fail |
133+
| 32 | informational | informational | record but don't fail |
134+
| 64 | exclude | exclude | mock saturation, not gateable |
135+
136+
Proposed tolerance bands (`max(absolute_ms, relative_%)`):
137+
- Δp50: `max(±100 ms, ±5%)`
138+
- with-guardrails p50: `max(±150 ms, ±3%)`
139+
140+
Both bands are ~3× the observed local run-to-run range, leaving headroom for
141+
CI hardware noise being noisier than a quiet laptop.
142+
143+
## Open questions / followups
144+
145+
- **Local baselines won't transfer to CI hardware.** These numbers should seed
146+
the baseline file but be replaced once we have N runs from the actual CI
147+
runner class.
148+
- **Three samples is a small N.** Worth one more local run (Run 4) before we
149+
treat the means above as canonical, but the c=1–c=8 numbers are unlikely
150+
to budge meaningfully.
151+
- **c=64 instability is downstream of NMP.** Hypothesis: app mock's 4 workers
152+
saturate at concurrency 64 (4 RPS ceiling on 4.0s sleep). Easy to test by
153+
running with `--mock-workers 16`. Not blocking the gate work since c=64 is
154+
excluded anyway.

plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/aiperf_runner.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,12 +60,15 @@ def prepare_runtime_aiperf_config(
6060
template_path: Path,
6161
runtime_config_path: Path,
6262
aiperf_output_dir: Path,
63+
model_ref: str | None = None,
6364
) -> dict[str, Any]:
6465
"""Materialize the AIPerf config this run will use.
6566
6667
Reads the checked-in ``template_path`` config, overrides its
67-
``output_base_dir`` to point inside the current run's directory, and writes
68-
the result to ``runtime_config_path``. AIPerf is later invoked with
68+
``output_base_dir`` to point inside the current run's directory, optionally
69+
overrides ``base_config.model`` (so the same template can target multiple
70+
VirtualModels in one harness invocation), and writes the result to
71+
``runtime_config_path``. AIPerf is later invoked with
6972
``--config-file <runtime_config_path>`` so every artifact lands under a
7073
separate per-run directory.
7174
@@ -82,6 +85,11 @@ def prepare_runtime_aiperf_config(
8285
# Point AIPerf's output_base_dir at this run's directory so its results
8386
# nest under our per-run artifacts tree.
8487
config["output_base_dir"] = str(aiperf_output_dir)
88+
if model_ref is not None:
89+
base_config = config.get("base_config")
90+
if not isinstance(base_config, dict):
91+
raise ValueError(f"Expected `base_config` mapping in {template_path}, got {type(base_config).__name__}")
92+
base_config["model"] = model_ref
8593
runtime_config_path.parent.mkdir(parents=True, exist_ok=True)
8694
runtime_config_path.write_text(yaml.safe_dump(config, sort_keys=False), encoding="utf-8")
8795

0 commit comments

Comments
 (0)