|
| 1 | +# Local baseline results — 2026-06-16 |
| 2 | + |
| 3 | +Three back-to-back runs of `make benchmark-guardrails` on a local MacBook Pro |
| 4 | +(Apple Silicon), no other heavy workloads running. Goal: characterize the |
| 5 | +run-to-run variance of the new with-guardrails / without-guardrails harness so |
| 6 | +we can decide what's gateable in CI. |
| 7 | + |
| 8 | +## Hardware / setup |
| 9 | + |
| 10 | +- Host: MacBook Pro, Apple Silicon, on AC power |
| 11 | +- NMP, mocks, shim: all on localhost |
| 12 | +- Mock LLM config: in-repo defaults (`plugins/nemo-guardrails/benchmarks/configs/mock_llm/`) |
| 13 | + - app LLM: 4.0s e2e latency, std 0 |
| 14 | + - content-safety LLM: 0.5s e2e latency, std 0 |
| 15 | +- AIPerf sweep: concurrency `[1, 2, 4, 8, 16, 32, 64]`, `benchmark_duration: 60s`, |
| 16 | + `warmup_request_count: 10`, non-streaming chat completions |
| 17 | +- Mock workers: 4 (default) |
| 18 | +- Three runs in the same afternoon, NMP data dir reused across runs |
| 19 | + |
| 20 | +## Run inventory |
| 21 | + |
| 22 | +| Run | Run dir | Notes | |
| 23 | +|---|---|---| |
| 24 | +| 1 | `20260616_123851` | first run after the with/without harness change | |
| 25 | +| 2 | `20260616_145058` | identical config | |
| 26 | +| 3 | `20260616_152834` | identical config | |
| 27 | + |
| 28 | +All three runs completed with 7/7 sweeps passing per variant, exit code 0. |
| 29 | + |
| 30 | +## Δp50 (with-guardrails − without-guardrails), milliseconds |
| 31 | + |
| 32 | +This is the headline metric: how much wall-clock time the guardrails middleware |
| 33 | +adds on top of the bare NMP+IGW path, including the two content-safety LLM |
| 34 | +round-trips that the rails cause but don't do themselves. |
| 35 | + |
| 36 | +| Run | c=1 | c=2 | c=4 | c=8 | c=16 | c=32 | c=64 | |
| 37 | +|---------|-----:|-----:|-----:|-----:|-----:|-----:|--------:| |
| 38 | +| Run 1 | 1029 | 1071 | 1068 | 1104 | 1145 | 1260 | 778 | |
| 39 | +| Run 2 | 1027 | 1062 | 1096 | 1105 | 1226 | 1256 | -2896 | |
| 40 | +| Run 3 | 1030 | 1062 | 1079 | 1070 | 1118 | 1201 | -2077 | |
| 41 | +| **mean**| **1029** | **1065** | **1081** | **1093** | **1163** | **1239** | **−1398** | |
| 42 | +| range | 3 | 9 | 28 | 35 | 108 | 59 | 3674 | |
| 43 | +| range % | 0.3% | 0.8% | 2.6% | 3.2% | 9.3% | 4.8% | n/a | |
| 44 | + |
| 45 | +## with-guardrails p50 (absolute), milliseconds |
| 46 | + |
| 47 | +Useful as a sanity check that nothing catastrophic shifted in the absolute |
| 48 | +numbers — even if Δp50 stays steady, both variants could slow down together. |
| 49 | + |
| 50 | +| Run | c=1 | c=2 | c=4 | c=8 | c=16 | c=32 | c=64 | |
| 51 | +|---------|-----:|-----:|-----:|-----:|-----:|-----:|-----:| |
| 52 | +| Run 1 | 5049 | 5101 | 5114 | 5152 | 5201 | 5318 | 6164 | |
| 53 | +| Run 2 | 5048 | 5093 | 5125 | 5137 | 5255 | 5279 | 5614 | |
| 54 | +| Run 3 | 5050 | 5094 | 5123 | 5146 | 5163 | 5250 | 5486 | |
| 55 | +| **mean**| **5049** | **5096** | **5121** | **5145** | **5206** | **5282** | **5755** | |
| 56 | +| range | 2 | 8 | 11 | 15 | 92 | 68 | 678 | |
| 57 | +| range % | 0.0% | 0.2% | 0.2% | 0.3% | 1.8% | 1.3% | 11.8%| |
| 58 | + |
| 59 | +## without-guardrails p50 (absolute), milliseconds |
| 60 | + |
| 61 | +For completeness. This is the variant that's wildly unstable at c=64. |
| 62 | + |
| 63 | +| Run | c=1 | c=2 | c=4 | c=8 | c=16 | c=32 | c=64 | |
| 64 | +|---------|-----:|-----:|-----:|-----:|-----:|-----:|-----:| |
| 65 | +| Run 1 | 4020 | 4030 | 4045 | 4048 | 4056 | 4058 | 5386 | |
| 66 | +| Run 2 | 4020 | 4031 | 4029 | 4032 | 4029 | 4023 | 8510 | |
| 67 | +| Run 3 | 4020 | 4032 | 4044 | 4076 | 4045 | 4049 | 7563 | |
| 68 | +| **mean**| **4020** | **4031** | **4039** | **4052** | **4043** | **4043** | **7153** | |
| 69 | +| range | 0 | 2 | 16 | 44 | 27 | 35 | 3124 | |
| 70 | + |
| 71 | +The app mock sleeps for exactly 4.0s. The ~20–80 ms above 4000 across c=1–c=32 |
| 72 | +is pure NMP+IGW+shim overhead. At c=64 the mock saturates (4 workers × 1 req/4s |
| 73 | += 4 RPS ceiling, vs. 64 requested in-flight) and requests queue. |
| 74 | + |
| 75 | +## p90 — informational only |
| 76 | + |
| 77 | +p90 is much noisier than p50 across runs. Not gateable with three samples. |
| 78 | + |
| 79 | +### Δp90, milliseconds |
| 80 | + |
| 81 | +| Run | c=1 | c=2 | c=4 | c=8 | c=16 | c=32 | c=64 | |
| 82 | +|-------|-----:|-----:|-----:|-----:|-----:|-----:|------:| |
| 83 | +| Run 1 | 1039 | 1099 | 1162 | 1025 | 911 | 604 | 3009 | |
| 84 | +| Run 2 | 1028 | 1115 | 1160 | 1262 | 783 | 641 | 1015 | |
| 85 | +| Run 3 | 1023 | 1076 | 1189 | 1085 | 1209 | 18 | 1998 | |
| 86 | + |
| 87 | +## Observations |
| 88 | + |
| 89 | +### What's stable enough to gate on |
| 90 | + |
| 91 | +**c=1, 2, 4, 8.** The Δp50 ranges are 3–35 ms, well under any tolerance we'd |
| 92 | +realistically write. The absolute with-guardrails p50 is even tighter (2–15 ms |
| 93 | +across three runs). This is the regime where the harness is genuinely measuring |
| 94 | +what we want: NMP+middleware overhead on top of fixed-latency mocks. |
| 95 | + |
| 96 | +### What's borderline |
| 97 | + |
| 98 | +**c=16.** Δp50 range is 9.3%. Gateable with a generous tolerance (~10%+) but |
| 99 | +adds limited signal beyond c=8. |
| 100 | + |
| 101 | +### What's not gateable |
| 102 | + |
| 103 | +**c=32.** ~5% Δp50 range. Still bounded, but the run-to-run distance is |
| 104 | +several times larger than at c=1–c=8 and the absolute numbers wobble too. |
| 105 | + |
| 106 | +**c=64.** Unusable. Δp50 swings from +778 to −2896 across three runs. |
| 107 | +Root cause is the app mock's 4-worker saturation at this load level: the |
| 108 | +without-guardrails path fires app requests as fast as it can and the mock queues |
| 109 | +unpredictably. The with-guardrails path's CS-mock work paces requests enough to |
| 110 | +hide most of this. This is a test-rig artifact, not an NMP behavior. |
| 111 | + |
| 112 | +### Side observation: middleware overhead is small |
| 113 | + |
| 114 | +Of the ~1029 ms Δp50 at c=1: |
| 115 | +- ~1000 ms is the two content-safety mock round-trips (0.5s each, mandatory). |
| 116 | +- ~29 ms is the middleware's *own* work (rails orchestration, request/response |
| 117 | + shaping, etc.) plus bare NMP+IGW overhead delta vs. without-guardrails. |
| 118 | + |
| 119 | +The without-guardrails baseline of ~4020 ms at c=1 against a 4000 ms mock means |
| 120 | +**bare NMP+IGW+shim overhead is ~20 ms** at idle. |
| 121 | + |
| 122 | +## Recommendation for the CI gate |
| 123 | + |
| 124 | +Based on the variance data above: |
| 125 | + |
| 126 | +| Concurrency | Gate Δp50? | Gate absolute with-guardrails p50? | Notes | |
| 127 | +|---|---|---|---| |
| 128 | +| 1 | yes | yes | tightest signal | |
| 129 | +| 2 | yes | yes | | |
| 130 | +| 4 | yes | yes | | |
| 131 | +| 8 | yes | yes | | |
| 132 | +| 16 | informational | informational | record but don't fail | |
| 133 | +| 32 | informational | informational | record but don't fail | |
| 134 | +| 64 | exclude | exclude | mock saturation, not gateable | |
| 135 | + |
| 136 | +Proposed tolerance bands (`max(absolute_ms, relative_%)`): |
| 137 | +- Δp50: `max(±100 ms, ±5%)` |
| 138 | +- with-guardrails p50: `max(±150 ms, ±3%)` |
| 139 | + |
| 140 | +Both bands are ~3× the observed local run-to-run range, leaving headroom for |
| 141 | +CI hardware noise being noisier than a quiet laptop. |
| 142 | + |
| 143 | +## Open questions / followups |
| 144 | + |
| 145 | +- **Local baselines won't transfer to CI hardware.** These numbers should seed |
| 146 | + the baseline file but be replaced once we have N runs from the actual CI |
| 147 | + runner class. |
| 148 | +- **Three samples is a small N.** Worth one more local run (Run 4) before we |
| 149 | + treat the means above as canonical, but the c=1–c=8 numbers are unlikely |
| 150 | + to budge meaningfully. |
| 151 | +- **c=64 instability is downstream of NMP.** Hypothesis: app mock's 4 workers |
| 152 | + saturate at concurrency 64 (4 RPS ceiling on 4.0s sleep). Easy to test by |
| 153 | + running with `--mock-workers 16`. Not blocking the gate work since c=64 is |
| 154 | + excluded anyway. |
0 commit comments