-
Notifications
You must be signed in to change notification settings - Fork 6
feat(guardrails): Update benchmark job; add latency analysis step #359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
9addf61
f78ffa8
78b7730
2acf55b
f0fef0a
98f5d45
8a55a3a
dcaf12f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -15,9 +15,11 @@ benchmark modules with `PYTHONPATH` pointed at that checkout. | |||||
| plugins/nemo-guardrails/benchmarks/ | ||||||
| configs/ | ||||||
| nmp_igw_guardrails_sweep_concurrency.yaml # AIPerf sweep template | ||||||
| mock_llm/ # in-repo mock LLM env files | ||||||
| artifacts/ # per-run outputs (gitignored) | ||||||
| plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/ | ||||||
| run.py # entrypoint: `python -m nemo_guardrails_plugin.benchmarks.run` | ||||||
| analyze.py # post-run analysis; checks latencies against baseline values | ||||||
| paths.py # filesystem layout | ||||||
| constants.py # workspace / VM / provider names | ||||||
| processes.py # subprocess supervision (process groups + ExitStack) | ||||||
|
|
@@ -181,15 +183,76 @@ plugins/nemo-guardrails/benchmarks/artifacts/runs/<timestamp>/ | |||||
|
|
||||||
| ## CI | ||||||
|
|
||||||
| A `benchmark-guardrails` job in `.github/workflows/ci.yaml` checks out both | ||||||
| this repo and `NVIDIA/NeMo-Guardrails`, runs `make bootstrap-python` and | ||||||
| `make benchmark-guardrails`, and uploads the per-run artifacts directory | ||||||
| (`logs/`, `generated/`, `aiperf_results/`) on success or failure. | ||||||
| Two jobs in `.github/workflows/ci.yaml`: | ||||||
|
|
||||||
| Pass/fail is driven by the harness's exit code, which is non-zero if `aiperf` | ||||||
| itself exits non-zero or any sweep returns a non-zero exit code. No latency | ||||||
| thresholds are enforced — those can be layered on later by a separate | ||||||
| analyzer that reads the per-sweep CSVs. | ||||||
| - `guardrails-benchmark` — matrix of two parallel jobs, one per variant | ||||||
| (`with-guardrails`, `without-guardrails`), each on its own NMP instance. | ||||||
| Uploads per-variant artifacts (`logs/`, `generated/`, `aiperf_results/`). | ||||||
| - `guardrails-benchmark-analyze` — joins the two matrix jobs, downloads both | ||||||
| artifacts, prints a side-by-side comparison via | ||||||
| `nemo_guardrails_plugin.benchmarks.analyze`, and runs the baseline check | ||||||
| (see below). Fails the build on a latency regression beyond tolerance. The | ||||||
| analyzer is stdlib-only by design, so this job runs on the runner's stock | ||||||
| `python3` without bootstrapping the uv workspace. | ||||||
|
|
||||||
| ### Baseline and gating | ||||||
|
|
||||||
| CI compares the run's delta_p50 (with-guardrails minus without-guardrails | ||||||
| p50, in ms) against a checked-in baseline. The baseline lives as | ||||||
| module-level constants in: | ||||||
|
|
||||||
| ```text | ||||||
| plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/analyze.py | ||||||
| ``` | ||||||
|
|
||||||
| Why only delta_p50 (and not absolute with-guardrails p50)? delta_p50 | ||||||
| isolates the middleware's contribution — shared CI runner noise cancels | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is not quite true because in CI, the two jobs could run on separate runners (machines). If runnerA happens to be slow and B fast, the delta is inflated (and vice versa). So the inter-runner noise is actually not cancelled. |
||||||
| across the two variants. | ||||||
|
|
||||||
| #### Baseline constants | ||||||
|
|
||||||
| - `CONCURRENCIES_TO_VALIDATE: list[int]` — concurrency levels to gate on. | ||||||
| Other levels still appear in the analyzer's output tables, but pass/fail | ||||||
| is decided only by these. | ||||||
| - `DEFAULT_DELTA_P50_TOLERANCE_MS: int` — default tolerance (in ms) applied | ||||||
| to every validated concurrency. A check fails when | ||||||
| `|observed - baseline| > tolerance`. | ||||||
| - `DELTA_P50_TOLERANCE_OVERRIDES_MS: dict[int, int]` — per-concurrency | ||||||
| tolerance overrides (in ms). Levels without an override fall back to the | ||||||
| default. | ||||||
| - `DELTA_P50_BASELINE_BY_CONCURRENCY: dict[int, int]` — expected delta_p50 | ||||||
| (in ms) per concurrency level. Edit by hand when a real change shifts | ||||||
| the numbers. | ||||||
|
|
||||||
| Worked example: at c=16 the override is 200 ms, so a run with observed | ||||||
| delta_p50 = 1689 (diff +199 from baseline 1390) passes; observed | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the math is off... |
||||||
| delta_p50 = 1691 (diff +201) fails. | ||||||
|
|
||||||
| Notes on the current values: | ||||||
|
|
||||||
| - c=16 and c=32 use wider tolerances than the default because their | ||||||
| absolute delta_p50 is larger. Over time, we can tighten these values | ||||||
| if latencies in CI produce less variance. | ||||||
| - Any change to mock-LLM latencies, the guardrails config, or the runner | ||||||
| class invalidates the current baseline values. The benchmark should be | ||||||
| re-run in CI several tiems to establish updated baseline values. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| #### Running the analyzer locally | ||||||
|
|
||||||
| ```bash | ||||||
| python3 plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/analyze.py \ | ||||||
| plugins/nemo-guardrails/benchmarks/artifacts/runs/<run-id> | ||||||
| ``` | ||||||
|
|
||||||
| Local runs print both tables and the baseline-check table. | ||||||
| CI passes `--strict` to make any out-of-tolerance check fail the job. | ||||||
|
|
||||||
| #### Updating the baseline | ||||||
|
|
||||||
| When a real change shifts the numbers (ex. a deliberate middleware change, | ||||||
| a mock-LLM config change, or a runner-class change), edit the constants at | ||||||
| the top of `analyze.py` by hand and reference the PR / CI run that | ||||||
| justifies it in the commit. | ||||||
|
|
||||||
| ## Cleanup | ||||||
|
|
||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # Mock LLM configurations | ||
|
|
||
| These `.env` files configure the behavior of the mock LLMs, used by the upstream | ||
| `nemo-guardrails` library's `benchmark.mock_llm_server.run_server`. | ||
|
|
||
| The library stores these files, but we keep our own copies so: | ||
|
|
||
| - We can change mock latency without touching the upstream repo. | ||
| - The exact mock behavior we benchmarked against is versioned alongside the | ||
| results, so historical numbers stay reproducible even if upstream changes | ||
| its defaults. | ||
|
|
||
| Mapping to upstream files: | ||
| - `app-llm.env` ← upstream `meta-llama-3.3-70b-instruct.env` | ||
| - `content-safety-llm.env` ← upstream `nvidia-llama-3.1-nemoguard-8b-content-safety.env` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| MODEL="meta/llama-3.3-70b-instruct" | ||
| UNSAFE_PROBABILITY=0.0 | ||
| UNSAFE_TEXT="I can't help with that. Is there anything else I can assist you with?" | ||
| SAFE_TEXT="I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities." | ||
| # End-to-end latency | ||
| E2E_LATENCY_MIN_SECONDS=4.0 | ||
| E2E_LATENCY_MAX_SECONDS=4.0 | ||
| E2E_LATENCY_MEAN_SECONDS=4.0 | ||
| E2E_LATENCY_STD_SECONDS=0.0 | ||
| # Streaming latency: Time to First Token (TTFT) | ||
| TTFT_MIN_SECONDS=0.3 | ||
| TTFT_MAX_SECONDS=0.3 | ||
| TTFT_MEAN_SECONDS=0.3 | ||
| TTFT_STD_SECONDS=0.0 | ||
| # Streaming latency: Chunk Latency (ITL) | ||
| CHUNK_LATENCY_MIN_SECONDS=0.015 | ||
| CHUNK_LATENCY_MAX_SECONDS=0.015 | ||
| CHUNK_LATENCY_MEAN_SECONDS=0.015 | ||
| CHUNK_LATENCY_STD_SECONDS=0.0 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| MODEL="nvidia/llama-3.1-nemoguard-8b-content-safety" | ||
| UNSAFE_PROBABILITY=0.0 | ||
| UNSAFE_TEXT="{\"User Safety\": \"unsafe\", \"Response Safety\": \"unsafe\", \"Safety Categories\": \"Violence, Criminal Planning/Confessions\"}" | ||
| SAFE_TEXT="{\"User Safety\": \"safe\", \"Response Safety\": \"safe\"}" | ||
| # End-to-end latency | ||
| E2E_LATENCY_MIN_SECONDS=0.5 | ||
| E2E_LATENCY_MAX_SECONDS=0.5 | ||
| E2E_LATENCY_MEAN_SECONDS=0.5 | ||
| E2E_LATENCY_STD_SECONDS=0.0 | ||
| # Streaming latency: Time to First Token (TTFT) | ||
| TTFT_MIN_SECONDS=0.2 | ||
| TTFT_MAX_SECONDS=0.2 | ||
| TTFT_MEAN_SECONDS=0.2 | ||
| TTFT_STD_SECONDS=0.0 | ||
| # Streaming latency: Chunk Latency (ITL) | ||
| CHUNK_LATENCY_MIN_SECONDS=0.015 | ||
| CHUNK_LATENCY_MAX_SECONDS=0.015 | ||
| CHUNK_LATENCY_MEAN_SECONDS=0.015 | ||
| CHUNK_LATENCY_STD_SECONDS=0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.