|
| 1 | +# Project: Agent Action Guard |
| 2 | + |
| 3 | +Agent Action Guard classifies proposed AI agent actions as safe or harmful and blocks or flags harmful actions. This repository provides the model, dataset, integration helpers, and example MCP-compatible tooling to enable runtime action screening in agent loops. |
| 4 | +- Repository URL: https://github.com/Pro-GenAI/Agent-Action-Guard |
| 5 | + |
| 6 | +## Why it matters |
| 7 | + |
| 8 | +- Helps prevent autonomous agents from executing harmful, unethical, or risky operations. |
| 9 | +- Provides a reproducible benchmark (AgentHarmBench) and dataset (HarmActions) for evaluating agent safety. |
| 10 | +- Lightweight model for easy integration into MCP or custom agent frameworks. |
| 11 | + |
| 12 | +## Quick Usage (for agents) |
| 13 | + |
| 14 | +1. Install the package (recommended in a venv): |
| 15 | + |
| 16 | +```bash |
| 17 | +python3 -m venv .venv |
| 18 | +source .venv/bin/activate |
| 19 | +pip install agent-action-guard |
| 20 | +``` |
| 21 | + |
| 22 | +2. Start or configure an embedding server if using vector features (see `USAGE.md`). |
| 23 | + |
| 24 | +3. In your agent runtime, call the convenience API to check actions before execution: |
| 25 | + |
| 26 | +```python |
| 27 | +from agent_action_guard import is_action_harmful |
| 28 | + |
| 29 | +is_harmful, confidence = is_action_harmful(action_dict) |
| 30 | +if is_harmful: |
| 31 | + # block, log, or escalate |
| 32 | + raise Exception("Harmful action blocked") |
| 33 | +``` |
| 34 | + |
| 35 | +## Key Files & Structure |
| 36 | + |
| 37 | +- `agent_action_guard/` — implementation package (classifier, runtime helpers, dataset loaders). |
| 38 | +- `training/` — training scripts and dataset artifacts used to produce the classifier. |
| 39 | +- `examples/` — sample integrations and MCP server examples. |
| 40 | +- `tests/` — unit tests validating core behavior. |
| 41 | +- `USAGE.md` — detailed usage examples and environment setup. |
| 42 | +- `README.md` — project overview, demos, and citations. |
| 43 | + |
| 44 | +## Architecture Overview |
| 45 | + |
| 46 | +- Input: proposed agent action (structured dict describing tool call, intent, parameters). |
| 47 | +- Preprocessing: optional embedding + metadata normalization. |
| 48 | +- Classifier: lightweight NN (PyTorch / ONNX) outputs harmful/safe logits and confidence. |
| 49 | +- Policy: decision layer in the agent runtime that blocks, allows, or requests human approval. |
| 50 | + |
| 51 | +## Development & CI |
| 52 | + |
| 53 | +- Formatting and linting: `make format` and `make lint` (created through `Makefile`). |
| 54 | +- Tests: run `pytest` (configured by `pytest.ini`) to run test cases in [`tests/`](tests/) directory. |
| 55 | + |
| 56 | +## Guidance for AI agents reading this repo |
| 57 | + |
| 58 | +- Use `USAGE.md` and `examples/` for integration patterns rather than reproducing code. |
| 59 | +- Prefer runtime API `is_action_harmful()` for decision making. |
| 60 | +- Respect model limitations: the classifier is trained on a limited dataset; combine with rule-based checks for high-risk systems. |
| 61 | + |
| 62 | +## Where to look next (quick links) |
| 63 | + |
| 64 | +- Full details and demo: [README.md](README.md) |
| 65 | +- Integration and examples: [USAGE.md](USAGE.md) and `examples/` |
| 66 | +- Implementation: `agent_action_guard/` |
| 67 | +- Training scripts & dataset: `training/` |
0 commit comments