You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
⚠️ When AI agents are provided with a harmful tool and an instruction, they just use it. Popular and high-performing latest LLMs are not an exception.
16
16
17
-
🤖 AI is perceived as a threat. Increasing usage of agents leads to the usage of harmful tools and harmful usage of tools as proven using **HarmActEval**. Classifying AI agent actions ensures safety and reliability. Action Guard uses a neural network model trained on **HarmActions** dataset to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines. ✅ Safe AI Agents are made possible by Action Guard.
17
+
🤖 AI is perceived as a threat. Increasing usage of agents leads to the usage of harmful tools and harmful usage of tools as proven using **AgentHarmBench**. Classifying AI agent actions ensures safety and reliability. Action Guard uses a neural network model trained on **HarmActions** dataset to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines. ✅ Safe AI Agents are made possible by Action Guard.
## 🆕 New contributions of Agent-Action-Guard framework:
51
51
1. 📊 **HarmActions**, a structured dataset of safety-labeled agent actions complemented with manipulated prompts that trigger harmful or unethical actions.
52
-
2. 📏 **HarmActEval** benchmark leveraging a new metric "Safe@k."
52
+
2. 📏 **AgentHarmBench** benchmark leveraging a new metric "SafeActions@k."
53
53
3. 🧠 **Action Guard**, a neural classifier trained on HarmActions dataset, designed to label proposed agent actions as potentially harmful or safe, and optimized for real-time deployment in agent loops.
54
54
4. 🔌 MCP integration supporting live action screening using existing MCP servers and clients.
55
55
56
-
## 📊 HarmActEval Results
56
+
## 📊 AgentHarmBench Results
57
57
58
-
⚡ Popular and latest LLMs generate harmful actions, proving the need for the action guard and HarmActEval benchmark.
58
+
⚡ Popular and latest LLMs generate harmful actions, proving the need for the action guard and AgentHarmBench benchmark.
59
59
60
-
| Model | Safe@1 score |
61
-
|-----------------------|-------------:|
62
-
| Phi 4 Mini Instruct | 0.00% |
63
-
| Granite 4-H-Tiny | 0.00% |
60
+
| Model | Safe@1 score |
61
+
|------------------------|-------------:|
62
+
| Phi 4 Mini Instruct | 0.00% |
63
+
| Granite 4-H-Tiny | 0.00% |
64
64
|*Claude Haiku 4.5 | 0.00% |
65
65
|*Gemini 3.1 Flash Lite | 1.33% |
66
-
| Ministral 3 (3B) | 2.67% |
66
+
| Ministral 3 (3B) | 2.67% |
67
67
|*Claude Sonnet 4.6 | 4.00% |
68
-
| Phi 4 Mini Reasoning | 5.33% |
68
+
| Phi 4 Mini Reasoning | 5.33% |
69
69
|*GPT-5.3 | 17.33% |
70
-
|**Average**| 5.07% |
70
+
|**Average**| 5.07% |
71
71
72
72
Note: "*" denotes popular proprietary models.
73
73
74
-
📌 Note: Higher Safe@k score is better.
74
+
📌 Note: Higher SafeActions@k score is better.
75
75
76
76
## ✨ Special features:
77
-
- This project introduces "HarmActEval" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
77
+
- This project introduces "AgentHarmBench" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
78
78
- The dataset has been used to train a lightweight neural network model that classifies actions as safe, harmful, or unethical.
79
79
- ⚡ The model is lightweight and can be easily integrated into existing AI agent frameworks like MCP.
🔑 Note: The embedding client accepts an API key via the `EMBEDDING_API_KEY` environment variable (falls back to `OPENAI_API_KEY` if unset). See [.env.example](https://github.com/Pro-GenAI/Agent-Action-Guard/blob/main/.env.example) and [`USAGE.md`](https://github.com/Pro-GenAI/Agent-Action-Guard/blob/main/USAGE.md) for examples.
114
114
115
-
📦 Install with HarmActEval:
115
+
📦 Install with AgentHarmBench:
116
116
117
117
```bash
118
-
pip install "agent-action-guard[harmacteval]"
119
-
python -m agent_action_guard.harmacteval
118
+
pip install "agent-action-guard[agentharmbench]"
119
+
python -m agent_action_guard.agentharmbench
120
120
```
121
-
> Note: The usage of HarmActEval requires OpenAI API key to be set in the environment variables.
121
+
> Note: The usage of AgentHarmBench requires OpenAI API key to be set in the environment variables.
0 commit comments