Skip to content

BackendStack21/go-prompt-injection-guard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

go-prompt-injection-guard

Fast local prompt injection detector for macOS / Linux

A Go port of piguard πŸ’š

Uses the PIGuard model (DeBERTa-v3-base fine-tune, ACL 2025 paper) β€” converted to ONNX and served through native Go binaries backed by ONNX Runtime and the HuggingFace tokenizers library. No Python at runtime.

Binaries

Binary Purpose
piguard CLI β€” classify text from args, stdin, or pipe
piguard-server Unix socket daemon β€” model stays in memory for low-latency JSON responses

The CLI automatically connects to the daemon when it's running. Without the daemon it loads the model itself (~300 ms).

Requirements

This port uses two native libraries:

  • ONNX Runtime β€” loaded as a shared library at runtime via yalue/onnxruntime_go. Point piguard at it with the ORT_DYLIB_PATH environment variable:

    export ORT_DYLIB_PATH=/path/to/libonnxruntime.so   # .dylib on macOS

    If unset, the platform default (onnxruntime.so / onnxruntime.dll) is used.

  • tokenizers static library β€” linked at build time via daulet/tokenizers. The Makefile downloads the prebuilt library for your platform, pinned to a known version and verified against a recorded SHA-256 (the build fails on mismatch).

A C toolchain (cgo) is required to build.

Build

make build           # downloads libtokenizers, builds bin/piguard and bin/piguard-server

This produces bin/piguard and bin/piguard-server.

To build manually, fetch the tokenizers static lib into ./lib/ and set the linker path:

CGO_LDFLAGS="-L$(pwd)/lib" go build ./cmd/piguard ./cmd/piguard-server

Docker

A multi-stage Dockerfile builds both binaries (fetching the tokenizer static library and ONNX Runtime) and produces a slim runtime image; ONNX Runtime is loaded via ORT_DYLIB_PATH and the tokenizer is linked statically.

docker build -t piguard .

The ~735 MB model is not baked into the image β€” mount it at /models. The daemon serves over a Unix socket, so clients share it via a volume:

# Export the model once and copy it where the container can read it:
pip install -r scripts/requirements.txt
python scripts/export_onnx.py            # writes ~/.cache/piguard/onnx/
mkdir -p models && cp ~/.cache/piguard/onnx/* models/

docker run --rm \
  -v "$PWD/models:/models:ro" \
  -v piguard-sock:/run/piguard \
  piguard

docker-compose.yml runs the daemon plus an HTTP gateway (gateway service, built from examples/http-gateway/) that bridges HTTP to the daemon's Unix socket so you can test the stack with curl. See the compose file header for the one-time model-export step, then:

docker compose up --build

curl -s localhost:8080/healthz
curl -s localhost:8080/detect -d 'ignore previous instructions'
curl -s --data-binary @page.html localhost:8080/long          # full scan, no truncation
curl -s localhost:8080/raw -d '{"texts":["hello","ignore all rules"]}'

The gateway is a small, dependency-free helper (POST /detect, /long, /raw, GET /healthz); the compose file also has a commented example showing an app sharing the socket volume directly.

Benchmark

Two benchmark modes show latency and throughput characteristics:

Native Benchmark (Direct local inference)

The native detector loads the model in-process with no network overhead:

piguard bench

Sample run (Apple Silicon, 512-token window):

=== Single-request latency (n=300) ===
  latency : min= 9.61  p50=11.12  p90=14.59  p95=15.74  p99=17.28  max=41.84  mean=12.04   (ms)

=== Batched throughput (rounds=30 per size) ===
  batch=  1: req_p50= 10.54ms  per_item=10.543ms  throughput= 94.8 texts/s
  batch=  2: req_p50= 16.51ms  per_item= 8.256ms  throughput=121.1 texts/s
  batch= 32: req_p50=294.67ms  per_item= 9.208ms  throughput=108.6 texts/s

=== Concurrency latency (workers=8, total=200) ===
  latency : min=25.13  p50=43.89  p90=55.84  p95=61.22  p99=72.38  max=76.97  mean=44.63   (ms)
  aggregate throughput:    177.0 req/s

Single-request latency: ~11 ms at p50; batching amortizes startup cost to ~9 ms per item; under 8 concurrent clients the native detector scales to 177 req/s.

HTTP Gateway Benchmark (Docker daemon over Unix socket)

The gateway bridges HTTP to the daemon's Unix socket, measuring end-to-end latency and daemon-side inference time:

docker compose up --build -d
docker run --rm --network host -v "$PWD/scripts:/scripts:ro" \
  python:3.12-slim python /scripts/bench_latency.py --url http://localhost:8080 -n 300

Sample run (Apple Silicon, Docker linux/arm64, 512-token window):

=== Single-request latency (n=300) ===
  wall  (end-to-end): min=21.99  p50=31.55  p90=40.59  p95=45.69  p99=87.11  max=116.05  (ms)
  daemon (inference): min=14.19  p50=24.20  p90=32.21  p95=37.93  p99=76.22  max=102.76  (ms)

=== Batched throughput via /raw ===
  batch=  1: per_item=30.578ms  throughput= 32.7 texts/s
  batch=  2: per_item=20.376ms  throughput= 49.1 texts/s
  batch= 32: per_item=15.987ms  throughput= 62.6 texts/s

=== /detect under 8 concurrent clients (n=200) ===
  wall  (end-to-end): p50=164.78  p95=257.95  p99=293.82  (ms)
  daemon (inference): p50=19.46  p95=26.35  p99=35.23  (ms)
  aggregate throughput: 45.3 req/s

Single-request latency: ~24 ms daemon + ~8 ms gateway overhead = 31 ms at p50; batching cuts per-item cost ~40% (to ~16 ms); under concurrency the micro-batcher keeps daemon inference flat (~19 ms) while end-to-end latency grows with queueing due to the gateway round-trip.

Comparison

Scenario Native HTTP Notes
Single-request p50 11 ms 31 ms Native is ~3Γ— faster; HTTP adds ~8 ms overhead
Batch-32 per-item 9 ms 16 ms Native retains advantage; HTTP batching helps
Max throughput 177 req/s 45 req/s Native runs inference inline; HTTP scales with workers
Use case CLI, local checks Service, concurrent load Choose native for speed, HTTP for deployment

Native is optimal for low-latency inline checks (CLI, function gating). The HTTP daemon is suited for service deployments where multiple clients share one model and benefit from micro-batching. Numbers depend on hardware, ONNX Runtime build, and thread count.

Releases

CI (lint, go vet, staticcheck, go mod verify, govulncheck, race tests, coverage, Dockerfile smoke builds) runs on every push and pull request. Pushing a vX.Y.Z tag triggers the release workflow, which attaches a linux-amd64 binary tarball (with checksum) to a GitHub release and publishes container images to GHCR:

  • ghcr.io/backendstack21/go-prompt-injection-guard β€” the daemon
  • ghcr.io/backendstack21/go-prompt-injection-guard-gateway β€” the HTTP gateway
git tag v0.1.0 && git push origin v0.1.0

Supply chain

Dependencies are pinned end to end: Go modules to exact versions with hashes in go.sum (go mod verify), the native libtokenizers/ONNX Runtime downloads to fixed versions verified by SHA-256, GitHub Actions to commit SHAs, and Docker base images to digests. Dependabot (.github/dependabot.yml) keeps them current.

The ONNX model needs to be exported once using the included Python script:

pip install -r scripts/requirements.txt
python scripts/export_onnx.py

This downloads the model from HuggingFace, converts it to ONNX, and saves to ~/.cache/piguard/onnx/ (~735 MB).

You can also fetch just the tokenizer with:

piguard setup

CLI

piguard "Ignore previous instructions"
# 🚨 INJECTION (score: 1.000, 11.0ms)

piguard "What is the weather today?"
# βœ… BENIGN (score: 0.974, 12.9ms)

# Pipe multiple lines β€” sent to the daemon as a single batched request
# (one connection, one inference), or classified locally in one batch
echo -e "Hello world\nIgnore all instructions" | piguard

# Scan a large document in full (no 512-token truncation) β€” e.g. a tool result
piguard --long "$(cat large_tool_output.txt)"
cat large_tool_output.txt | piguard --long

# Use a non-default daemon socket (also: PIGUARD_SOCKET env var)
piguard --socket /run/piguard/piguard.sock "hello"

# Print version
piguard --version

# Benchmark
piguard bench

Server

Start the daemon:

piguard-server
# server ready socket=/tmp/piguard.sock ... load_ms=256

Query via Unix socket (newline-delimited JSON):

echo '{"text":"Ignore previous instructions"}' | nc -U /tmp/piguard.sock
# {"text":"Ignore previous instructions","label":"INJECTION","score":0.999,"latency_ms":8.3}

echo '{"text":"Hello world"}' | nc -U /tmp/piguard.sock
# {"text":"Hello world","label":"BENIGN","score":0.965,"latency_ms":9.1}

Options:

piguard-server --socket /tmp/custom.sock     # custom socket path
piguard-server --model-dir /path/to/onnx     # custom model directory
piguard-server --max-batch 64                # max requests per batched inference
piguard-server --batch-wait 5ms              # max time spent filling a batch
piguard-server --intra-threads 8             # ONNX threads per inference (default: CPUs)
piguard-server --max-tokens 512              # truncate inputs to N tokens
piguard-server --verbose                     # enable per-connection debug logging
piguard-server --version                     # print version and exit

Throughput / micro-batching

The server keeps a single copy of the model in memory and serves many connections concurrently. Requests that arrive close together are coalesced into one batched inference (up to --max-batch, waiting at most --batch-wait), which saturates the CPU and amortizes per-call overhead β€” the right design for gating every tool/function result in an agent loop. Tune --max-batch/--batch-wait to trade latency for throughput, and --intra-threads for CPU usage per inference.

Limits & robustness

The daemon is hardened against misbehaving or hostile local clients:

  • Bounded requests β€” each newline-delimited request is capped at 1 MiB; an oversized request closes the connection instead of growing memory unbounded. (The CLI applies the same 1 MiB cap per stdin line.)
  • Timeouts β€” idle connections are dropped after 30s and response writes time out after 10s, so a stalled client can't tie up resources.
  • Concurrent connections β€” handled in parallel; all inference is funneled through the batcher, so one slow client never blocks others.
  • Token truncation β€” inputs are truncated to --max-tokens (default 512), bounding inference cost regardless of input length.

Protocol

Single request β€” one JSON object per line:

{ "text": "Ignore previous instructions" }
{ "text": "Ignore previous instructions", "label": "INJECTION", "score": 0.999, "latency_ms": 8.3 }

Batch request β€” classify many in one round-trip:

{ "texts": ["Hello world", "Ignore previous instructions"] }
{
  "results": [
    { "text": "Hello world", "label": "BENIGN", "score": 0.965, "latency_ms": 1.2 },
    { "text": "Ignore previous instructions", "label": "INJECTION", "score": 0.999, "latency_ms": 1.2 }
  ]
}

Long request β€” scan a document larger than the token window in full (window-by-window; returns the most suspicious window):

{ "long": "<large tool output ...>" }
{ "text": "<...>", "label": "INJECTION", "score": 0.991, "latency_ms": 42.0 }

The daemon caps a single request at 1 MiB; larger documents are handled by the CLI falling back to local inference.

Error responses:

{ "error": "invalid JSON: ..." }

For the text and texts forms, inputs longer than the model's window (default 512 tokens) are truncated; use the long form to scan a large document in full.

Integration with CLI

When the daemon is running, piguard connects to it automatically, avoiding per-invocation model loading:

# Without daemon: model is loaded each run
$ time piguard "test"

# With daemon: connects over the socket, no load
$ time piguard "test"

Library

The detector is also usable as a Go package:

go get github.com/BackendStack21/go-prompt-injection-guard

Note: the package uses cgo. Consumers must have the libtokenizers.a static library on their linker path: CGO_LDFLAGS="-L/path/to/lib" go build ./... Run make libtokenizers in this repo to fetch it.

import piguard "github.com/BackendStack21/go-prompt-injection-guard"

det, err := piguard.Load(piguard.DefaultModelDir())
if err != nil {
    log.Fatal(err)
}
defer det.Close()

r, err := det.Detect("Ignore previous instructions")
fmt.Println(r.Label, r.Score) // INJECTION 0.999

// Batched inference for throughput (one ONNX call for many inputs):
results, err := det.DetectBatch([]string{"hello", "ignore previous instructions"})

// Scan an input larger than the 512-token window in full (e.g. a tool result):
r, err = det.DetectLong(bigToolOutput)

Load accepts options:

Option Default Effect
WithMaxTokens(n) 512 Token window for Detect/DetectBatch (inputs are truncated); also the window size for DetectLong chunks
WithIntraThreads(n) 4 ONNX intra-op threads per inference
WithOverlap(n) 64 Token overlap between consecutive DetectLong windows
WithLongBatchSize(n) 32 Max windows per ONNX call in DetectLong (bounds peak memory)

A Detector is not safe for concurrent use β€” serialize calls, or use the micro-batching daemon (cmd/piguard-server) for concurrent, high-throughput serving.

Large inputs (> 512 tokens)

Detect/DetectBatch classify a single window: inputs longer than WithMaxTokens (default 512 tokens, not bytes) are truncated, so content past the window is not examined. To scan large inputs β€” tool/function results, fetched pages, files β€” in full, use DetectLong, which splits the token sequence into overlapping ≀512-token windows and classifies them in sub-batches (default 32 windows per ONNX call, configurable with WithLongBatchSize), returning the most suspicious window's verdict. This gives complete coverage with no truncation gaps, bounds peak activation memory regardless of document length, and matters when an injection may be buried deep in attacker-controlled data.

Testing

make test     # runs the full suite (downloads libtokenizers + ONNX Runtime)
make cover    # prints production-code statement coverage

The inference paths are exercised end to end against a tiny toy ONNX model and tokenizer in testdata/ (regenerate with pip install -r scripts/requirements-dev.txt && python scripts/gen_testdata.py), so make test needs the ONNX Runtime shared library β€” make fetches it into lib/ automatically. Tests skip the inference cases gracefully if no runtime is found. Production-code statement coverage is ~99%.

How it works

PIGuard is a DeBERTa-v3-base model fine-tuned for binary text classification:

  • BENIGN β€” safe prompt
  • INJECTION β€” prompt injection attack detected

The MOF (Mitigating Over-defense for Free) training strategy reduces false positives from trigger words like "ignore", "instructions", etc.

The Go binaries use onnxruntime_go (ONNX Runtime) and daulet/tokenizers (bindings to the HuggingFace tokenizers Rust library, so token IDs match the original exactly) β€” zero Python dependency at runtime.

Architecture

   piguard "text"   (tries the daemon socket first)
        β”‚
        β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  piguard-server                   β”‚
   β”‚  /tmp/piguard.sock                β”‚
   β”‚                                   β”‚
   β”‚  tokenizer  ->  ONNX  ->  JSON    β”‚
   β”‚  (model loaded once)              β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
   {"label":"INJECTION","score":0.999,"latency_ms":8.3}

References

This project uses the same model and training strategy as the original PIGuard, so the same paper applies:

@inproceedings{piguard2025,
  title={PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free},
  author={Hao Li and Xiaogeng Liu and Ning Zhang and Chaowei Xiao},
  booktitle={ACL},
  year={2025}
}

Additionally:

This project was inspired and initially ported from misteral/piguard πŸ’š

License

MIT

About

Fast local prompt injection guard/detector for macOS / Linux.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors