Fast local prompt injection detector for macOS / Linux
A Go port of piguard π
Uses the PIGuard model (DeBERTa-v3-base fine-tune, ACL 2025 paper) β converted to ONNX and served through native Go binaries backed by ONNX Runtime and the HuggingFace tokenizers library. No Python at runtime.
| Binary | Purpose |
|---|---|
piguard |
CLI β classify text from args, stdin, or pipe |
piguard-server |
Unix socket daemon β model stays in memory for low-latency JSON responses |
The CLI automatically connects to the daemon when it's running. Without the daemon it loads the model itself (~300 ms).
This port uses two native libraries:
-
ONNX Runtime β loaded as a shared library at runtime via yalue/onnxruntime_go. Point
piguardat it with theORT_DYLIB_PATHenvironment variable:export ORT_DYLIB_PATH=/path/to/libonnxruntime.so # .dylib on macOS
If unset, the platform default (
onnxruntime.so/onnxruntime.dll) is used. -
tokenizers static library β linked at build time via daulet/tokenizers. The
Makefiledownloads the prebuilt library for your platform, pinned to a known version and verified against a recorded SHA-256 (the build fails on mismatch).
A C toolchain (cgo) is required to build.
make build # downloads libtokenizers, builds bin/piguard and bin/piguard-serverThis produces bin/piguard and bin/piguard-server.
To build manually, fetch the tokenizers static lib into ./lib/ and set the
linker path:
CGO_LDFLAGS="-L$(pwd)/lib" go build ./cmd/piguard ./cmd/piguard-serverA multi-stage Dockerfile builds both binaries (fetching the tokenizer static
library and ONNX Runtime) and produces a slim runtime image; ONNX Runtime is
loaded via ORT_DYLIB_PATH and the tokenizer is linked statically.
docker build -t piguard .The ~735 MB model is not baked into the image β mount it at /models. The
daemon serves over a Unix socket, so clients share it via a volume:
# Export the model once and copy it where the container can read it:
pip install -r scripts/requirements.txt
python scripts/export_onnx.py # writes ~/.cache/piguard/onnx/
mkdir -p models && cp ~/.cache/piguard/onnx/* models/
docker run --rm \
-v "$PWD/models:/models:ro" \
-v piguard-sock:/run/piguard \
piguarddocker-compose.yml runs the daemon plus an HTTP gateway (gateway
service, built from examples/http-gateway/) that bridges HTTP to the daemon's
Unix socket so you can test the stack with curl. See the compose file header for
the one-time model-export step, then:
docker compose up --build
curl -s localhost:8080/healthz
curl -s localhost:8080/detect -d 'ignore previous instructions'
curl -s --data-binary @page.html localhost:8080/long # full scan, no truncation
curl -s localhost:8080/raw -d '{"texts":["hello","ignore all rules"]}'The gateway is a small, dependency-free helper (POST /detect, /long, /raw,
GET /healthz); the compose file also has a commented example showing an app
sharing the socket volume directly.
Two benchmark modes show latency and throughput characteristics:
The native detector loads the model in-process with no network overhead:
piguard benchSample run (Apple Silicon, 512-token window):
=== Single-request latency (n=300) ===
latency : min= 9.61 p50=11.12 p90=14.59 p95=15.74 p99=17.28 max=41.84 mean=12.04 (ms)
=== Batched throughput (rounds=30 per size) ===
batch= 1: req_p50= 10.54ms per_item=10.543ms throughput= 94.8 texts/s
batch= 2: req_p50= 16.51ms per_item= 8.256ms throughput=121.1 texts/s
batch= 32: req_p50=294.67ms per_item= 9.208ms throughput=108.6 texts/s
=== Concurrency latency (workers=8, total=200) ===
latency : min=25.13 p50=43.89 p90=55.84 p95=61.22 p99=72.38 max=76.97 mean=44.63 (ms)
aggregate throughput: 177.0 req/s
Single-request latency: ~11 ms at p50; batching amortizes startup cost to ~9 ms per item; under 8 concurrent clients the native detector scales to 177 req/s.
The gateway bridges HTTP to the daemon's Unix socket, measuring end-to-end latency and daemon-side inference time:
docker compose up --build -d
docker run --rm --network host -v "$PWD/scripts:/scripts:ro" \
python:3.12-slim python /scripts/bench_latency.py --url http://localhost:8080 -n 300Sample run (Apple Silicon, Docker linux/arm64, 512-token window):
=== Single-request latency (n=300) ===
wall (end-to-end): min=21.99 p50=31.55 p90=40.59 p95=45.69 p99=87.11 max=116.05 (ms)
daemon (inference): min=14.19 p50=24.20 p90=32.21 p95=37.93 p99=76.22 max=102.76 (ms)
=== Batched throughput via /raw ===
batch= 1: per_item=30.578ms throughput= 32.7 texts/s
batch= 2: per_item=20.376ms throughput= 49.1 texts/s
batch= 32: per_item=15.987ms throughput= 62.6 texts/s
=== /detect under 8 concurrent clients (n=200) ===
wall (end-to-end): p50=164.78 p95=257.95 p99=293.82 (ms)
daemon (inference): p50=19.46 p95=26.35 p99=35.23 (ms)
aggregate throughput: 45.3 req/s
Single-request latency: ~24 ms daemon + ~8 ms gateway overhead = 31 ms at p50; batching cuts per-item cost ~40% (to ~16 ms); under concurrency the micro-batcher keeps daemon inference flat (~19 ms) while end-to-end latency grows with queueing due to the gateway round-trip.
| Scenario | Native | HTTP | Notes |
|---|---|---|---|
| Single-request p50 | 11 ms | 31 ms | Native is ~3Γ faster; HTTP adds ~8 ms overhead |
| Batch-32 per-item | 9 ms | 16 ms | Native retains advantage; HTTP batching helps |
| Max throughput | 177 req/s | 45 req/s | Native runs inference inline; HTTP scales with workers |
| Use case | CLI, local checks | Service, concurrent load | Choose native for speed, HTTP for deployment |
Native is optimal for low-latency inline checks (CLI, function gating). The HTTP daemon is suited for service deployments where multiple clients share one model and benefit from micro-batching. Numbers depend on hardware, ONNX Runtime build, and thread count.
CI (lint, go vet, staticcheck, go mod verify, govulncheck, race tests,
coverage, Dockerfile smoke builds) runs on every push and pull request. Pushing a vX.Y.Z tag triggers the release workflow, which
attaches a linux-amd64 binary tarball (with checksum) to a GitHub release and
publishes container images to GHCR:
ghcr.io/backendstack21/go-prompt-injection-guardβ the daemonghcr.io/backendstack21/go-prompt-injection-guard-gatewayβ the HTTP gateway
git tag v0.1.0 && git push origin v0.1.0Dependencies are pinned end to end: Go modules to exact versions with hashes in
go.sum (go mod verify), the native libtokenizers/ONNX Runtime downloads to
fixed versions verified by SHA-256, GitHub Actions to commit SHAs, and Docker
base images to digests. Dependabot (.github/dependabot.yml) keeps them current.
The ONNX model needs to be exported once using the included Python script:
pip install -r scripts/requirements.txt
python scripts/export_onnx.pyThis downloads the model from HuggingFace, converts it to ONNX, and saves to
~/.cache/piguard/onnx/ (~735 MB).
You can also fetch just the tokenizer with:
piguard setuppiguard "Ignore previous instructions"
# π¨ INJECTION (score: 1.000, 11.0ms)
piguard "What is the weather today?"
# β
BENIGN (score: 0.974, 12.9ms)
# Pipe multiple lines β sent to the daemon as a single batched request
# (one connection, one inference), or classified locally in one batch
echo -e "Hello world\nIgnore all instructions" | piguard
# Scan a large document in full (no 512-token truncation) β e.g. a tool result
piguard --long "$(cat large_tool_output.txt)"
cat large_tool_output.txt | piguard --long
# Use a non-default daemon socket (also: PIGUARD_SOCKET env var)
piguard --socket /run/piguard/piguard.sock "hello"
# Print version
piguard --version
# Benchmark
piguard benchStart the daemon:
piguard-server
# server ready socket=/tmp/piguard.sock ... load_ms=256Query via Unix socket (newline-delimited JSON):
echo '{"text":"Ignore previous instructions"}' | nc -U /tmp/piguard.sock
# {"text":"Ignore previous instructions","label":"INJECTION","score":0.999,"latency_ms":8.3}
echo '{"text":"Hello world"}' | nc -U /tmp/piguard.sock
# {"text":"Hello world","label":"BENIGN","score":0.965,"latency_ms":9.1}Options:
piguard-server --socket /tmp/custom.sock # custom socket path
piguard-server --model-dir /path/to/onnx # custom model directory
piguard-server --max-batch 64 # max requests per batched inference
piguard-server --batch-wait 5ms # max time spent filling a batch
piguard-server --intra-threads 8 # ONNX threads per inference (default: CPUs)
piguard-server --max-tokens 512 # truncate inputs to N tokens
piguard-server --verbose # enable per-connection debug logging
piguard-server --version # print version and exitThe server keeps a single copy of the model in memory and serves many
connections concurrently. Requests that arrive close together are coalesced
into one batched inference (up to --max-batch, waiting at most
--batch-wait), which saturates the CPU and amortizes per-call overhead β the
right design for gating every tool/function result in an agent loop. Tune
--max-batch/--batch-wait to trade latency for throughput, and
--intra-threads for CPU usage per inference.
The daemon is hardened against misbehaving or hostile local clients:
- Bounded requests β each newline-delimited request is capped at 1 MiB; an oversized request closes the connection instead of growing memory unbounded. (The CLI applies the same 1 MiB cap per stdin line.)
- Timeouts β idle connections are dropped after 30s and response writes time out after 10s, so a stalled client can't tie up resources.
- Concurrent connections β handled in parallel; all inference is funneled through the batcher, so one slow client never blocks others.
- Token truncation β inputs are truncated to
--max-tokens(default 512), bounding inference cost regardless of input length.
Single request β one JSON object per line:
{ "text": "Ignore previous instructions" }{ "text": "Ignore previous instructions", "label": "INJECTION", "score": 0.999, "latency_ms": 8.3 }Batch request β classify many in one round-trip:
{ "texts": ["Hello world", "Ignore previous instructions"] }{
"results": [
{ "text": "Hello world", "label": "BENIGN", "score": 0.965, "latency_ms": 1.2 },
{ "text": "Ignore previous instructions", "label": "INJECTION", "score": 0.999, "latency_ms": 1.2 }
]
}Long request β scan a document larger than the token window in full (window-by-window; returns the most suspicious window):
{ "long": "<large tool output ...>" }{ "text": "<...>", "label": "INJECTION", "score": 0.991, "latency_ms": 42.0 }The daemon caps a single request at 1 MiB; larger documents are handled by the CLI falling back to local inference.
Error responses:
{ "error": "invalid JSON: ..." }For the text and texts forms, inputs longer than the model's window (default
512 tokens) are truncated; use the long form to scan a large document in full.
When the daemon is running, piguard connects to it automatically, avoiding
per-invocation model loading:
# Without daemon: model is loaded each run
$ time piguard "test"
# With daemon: connects over the socket, no load
$ time piguard "test"
The detector is also usable as a Go package:
go get github.com/BackendStack21/go-prompt-injection-guardNote: the package uses cgo. Consumers must have the
libtokenizers.astatic library on their linker path:CGO_LDFLAGS="-L/path/to/lib" go build ./...Runmake libtokenizersin this repo to fetch it.
import piguard "github.com/BackendStack21/go-prompt-injection-guard"
det, err := piguard.Load(piguard.DefaultModelDir())
if err != nil {
log.Fatal(err)
}
defer det.Close()
r, err := det.Detect("Ignore previous instructions")
fmt.Println(r.Label, r.Score) // INJECTION 0.999
// Batched inference for throughput (one ONNX call for many inputs):
results, err := det.DetectBatch([]string{"hello", "ignore previous instructions"})
// Scan an input larger than the 512-token window in full (e.g. a tool result):
r, err = det.DetectLong(bigToolOutput)Load accepts options:
| Option | Default | Effect |
|---|---|---|
WithMaxTokens(n) |
512 | Token window for Detect/DetectBatch (inputs are truncated); also the window size for DetectLong chunks |
WithIntraThreads(n) |
4 | ONNX intra-op threads per inference |
WithOverlap(n) |
64 | Token overlap between consecutive DetectLong windows |
WithLongBatchSize(n) |
32 | Max windows per ONNX call in DetectLong (bounds peak memory) |
A Detector is not safe for concurrent use β serialize calls, or use the
micro-batching daemon (cmd/piguard-server) for concurrent, high-throughput
serving.
Detect/DetectBatch classify a single window: inputs longer than
WithMaxTokens (default 512 tokens, not bytes) are truncated, so content
past the window is not examined. To scan large inputs β tool/function results,
fetched pages, files β in full, use DetectLong, which splits the token
sequence into overlapping β€512-token windows and classifies them in sub-batches
(default 32 windows per ONNX call, configurable with WithLongBatchSize),
returning the most suspicious window's verdict. This gives complete coverage
with no truncation gaps, bounds peak activation memory regardless of document
length, and matters when an injection may be buried deep in attacker-controlled
data.
make test # runs the full suite (downloads libtokenizers + ONNX Runtime)
make cover # prints production-code statement coverageThe inference paths are exercised end to end against a tiny toy ONNX model and
tokenizer in testdata/ (regenerate with pip install -r scripts/requirements-dev.txt && python scripts/gen_testdata.py), so
make test needs the ONNX Runtime shared library β make fetches it into
lib/ automatically. Tests skip the inference cases gracefully if no runtime is
found. Production-code statement coverage is ~99%.
PIGuard is a DeBERTa-v3-base model fine-tuned for binary text classification:
- BENIGN β safe prompt
- INJECTION β prompt injection attack detected
The MOF (Mitigating Over-defense for Free) training strategy reduces false positives from trigger words like "ignore", "instructions", etc.
The Go binaries use onnxruntime_go (ONNX Runtime) and daulet/tokenizers (bindings to the HuggingFace tokenizers Rust library, so token IDs match the original exactly) β zero Python dependency at runtime.
piguard "text" (tries the daemon socket first)
β
βΌ
βββββββββββββββββββββββββββββββββββββ
β piguard-server β
β /tmp/piguard.sock β
β β
β tokenizer -> ONNX -> JSON β
β (model loaded once) β
βββββββββββββββββββββββββββββββββββββ
β
βΌ
{"label":"INJECTION","score":0.999,"latency_ms":8.3}
This project uses the same model and training strategy as the original PIGuard, so the same paper applies:
@inproceedings{piguard2025,
title={PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free},
author={Hao Li and Xiaogeng Liu and Ning Zhang and Chaowei Xiao},
booktitle={ACL},
year={2025}
}Additionally:
This project was inspired and initially ported from misteral/piguard π
MIT