Skip to content

Commit dcff712

Browse files
committed
Renamed HarmActEval to AgentHarmBench
1 parent b36ad5c commit dcff712

7 files changed

Lines changed: 46 additions & 46 deletions

File tree

.dockerignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,4 +51,4 @@ train_nn_hyperparam.py
5151

5252
# Exclude sample files if not needed
5353
run_sample_query.py
54-
HarmActEval_dataset.json
54+
AgentHarmBench_dataset.json

README.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ $ ffmpeg -i unused/banner_video.mp4 -vframes 1 project_banner.jpg
1414

1515
⚠️ When AI agents are provided with a harmful tool and an instruction, they just use it. Popular and high-performing latest LLMs are not an exception.
1616

17-
🤖 AI is perceived as a threat. Increasing usage of agents leads to the usage of harmful tools and harmful usage of tools as proven using **HarmActEval**. Classifying AI agent actions ensures safety and reliability. Action Guard uses a neural network model trained on **HarmActions** dataset to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines. ✅ Safe AI Agents are made possible by Action Guard.
17+
🤖 AI is perceived as a threat. Increasing usage of agents leads to the usage of harmful tools and harmful usage of tools as proven using **AgentHarmBench**. Classifying AI agent actions ensures safety and reliability. Action Guard uses a neural network model trained on **HarmActions** dataset to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines. ✅ Safe AI Agents are made possible by Action Guard.
1818

1919
<!-- [![YouTube](https://img.shields.io/badge/YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/watch?v=7pNYXv3x7MA) -->
2020
<!-- [![Blog](https://img.shields.io/badge/Blog-FFFFFF?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/blog/prane-eth/agent-action-guard) -->
@@ -49,32 +49,32 @@ $ ffmpeg -i unused/banner_video.mp4 -vframes 1 project_banner.jpg
4949

5050
## 🆕 New contributions of Agent-Action-Guard framework:
5151
1. 📊 **HarmActions**, a structured dataset of safety-labeled agent actions complemented with manipulated prompts that trigger harmful or unethical actions.
52-
2. 📏 **HarmActEval** benchmark leveraging a new metric "Safe@k."
52+
2. 📏 **AgentHarmBench** benchmark leveraging a new metric "SafeActions@k."
5353
3. 🧠 **Action Guard**, a neural classifier trained on HarmActions dataset, designed to label proposed agent actions as potentially harmful or safe, and optimized for real-time deployment in agent loops.
5454
4. 🔌 MCP integration supporting live action screening using existing MCP servers and clients.
5555

56-
## 📊 HarmActEval Results
56+
## 📊 AgentHarmBench Results
5757

58-
⚡ Popular and latest LLMs generate harmful actions, proving the need for the action guard and HarmActEval benchmark.
58+
⚡ Popular and latest LLMs generate harmful actions, proving the need for the action guard and AgentHarmBench benchmark.
5959

60-
| Model | Safe@1 score |
61-
|-----------------------|-------------:|
62-
| Phi 4 Mini Instruct | 0.00% |
63-
| Granite 4-H-Tiny | 0.00% |
60+
| Model | Safe@1 score |
61+
|------------------------|-------------:|
62+
| Phi 4 Mini Instruct | 0.00% |
63+
| Granite 4-H-Tiny | 0.00% |
6464
| *Claude Haiku 4.5 | 0.00% |
6565
| *Gemini 3.1 Flash Lite | 1.33% |
66-
| Ministral 3 (3B) | 2.67% |
66+
| Ministral 3 (3B) | 2.67% |
6767
| *Claude Sonnet 4.6 | 4.00% |
68-
| Phi 4 Mini Reasoning | 5.33% |
68+
| Phi 4 Mini Reasoning | 5.33% |
6969
| *GPT-5.3 | 17.33% |
70-
| **Average** | 5.07% |
70+
| **Average** | 5.07% |
7171

7272
Note: "*" denotes popular proprietary models.
7373

74-
📌 Note: Higher Safe@k score is better.
74+
📌 Note: Higher SafeActions@k score is better.
7575

7676
## ✨ Special features:
77-
- This project introduces "HarmActEval" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
77+
- This project introduces "AgentHarmBench" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
7878
- The dataset has been used to train a lightweight neural network model that classifies actions as safe, harmful, or unethical.
7979
- ⚡ The model is lightweight and can be easily integrated into existing AI agent frameworks like MCP.
8080
- 🔌 Supports MCP (Model Context Protocol) to allow real-time action classification.
@@ -112,13 +112,13 @@ uv pip install agent-action-guard
112112

113113
🔑 Note: The embedding client accepts an API key via the `EMBEDDING_API_KEY` environment variable (falls back to `OPENAI_API_KEY` if unset). See [.env.example](https://github.com/Pro-GenAI/Agent-Action-Guard/blob/main/.env.example) and [`USAGE.md`](https://github.com/Pro-GenAI/Agent-Action-Guard/blob/main/USAGE.md) for examples.
114114

115-
📦 Install with HarmActEval:
115+
📦 Install with AgentHarmBench:
116116

117117
```bash
118-
pip install "agent-action-guard[harmacteval]"
119-
python -m agent_action_guard.harmacteval
118+
pip install "agent-action-guard[agentharmbench]"
119+
python -m agent_action_guard.agentharmbench
120120
```
121-
> Note: The usage of HarmActEval requires OpenAI API key to be set in the environment variables.
121+
> Note: The usage of AgentHarmBench requires OpenAI API key to be set in the environment variables.
122122
123123
### 🏷️ License
124124

USAGE.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,10 @@ source .venv/bin/activate
1717
pip install agent-action-guard
1818
```
1919

20-
Install with HarmActEval CLI support:
20+
Install with AgentHarmBench CLI support:
2121

2222
```bash
23-
pip install "agent-action-guard[harmacteval]"
23+
pip install "agent-action-guard[agentharmbench]"
2424
```
2525

2626
2. Start an embedding server (if not already running).
@@ -155,12 +155,12 @@ PyPI package scope:
155155
- `pip install agent-action-guard` installs only the runtime classifier modules and model file needed for action classification.
156156
- Training, evaluation, MCP demo servers, and UI scripts remain in this repository and require the `dev` extras.
157157

158-
### HarmActEval CLI (standalone)
158+
### AgentHarmBench CLI (standalone)
159159

160-
After installing `agent-action-guard[harmacteval]`, run:
160+
After installing `agent-action-guard[agentharmbench]`, run:
161161

162162
```bash
163-
python -m agent_action_guard.harmacteval --k 3
163+
python -m agent_action_guard.agentharmbench --k 3
164164
```
165165

166166
Common arguments:
Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
"""
2-
HarmActEval CLI.
2+
AgentHarmBench CLI.
33
44
Run with:
5-
python -m agent_action_guard.harmacteval
5+
python -m agent_action_guard.agentharmbench
66
"""
77

88
from __future__ import annotations
@@ -39,7 +39,7 @@
3939
# Wait interval between attempts (seconds) to avoid LLM API rate limits
4040
WAIT_INTERVAL: int = int(os.getenv("WAIT_INTERVAL", "0"))
4141

42-
cache_path: Path = Path.cwd() / ".cache" / "harmacteval_cache.json"
42+
cache_path: Path = Path.cwd() / ".cache" / "agentharmbench_cache.json"
4343
cache: Dict[str, Any] = {}
4444

4545
# -------------------------------------------------------------------
@@ -365,7 +365,7 @@ def evaluate(
365365
logger.info("Model: %s", model_key)
366366
logger.info("k: %d", k)
367367
logger.info("Harm@k score: %.2f%%", percent_harmful)
368-
logger.info("Safe@k score: %.2f%%", percent_safe)
368+
logger.info("SafeActions@k score: %.2f%%", percent_safe)
369369
if output_path:
370370
logger.info("Saved results to %s", output_path)
371371

@@ -380,7 +380,7 @@ def evaluate(
380380
def build_parser() -> argparse.ArgumentParser:
381381

382382
parser = argparse.ArgumentParser(
383-
description="Run HarmActEval to measure harmful tool-call behavior."
383+
description="Run AgentHarmBench to measure harmful tool-call behavior."
384384
)
385385
parser.add_argument(
386386
"--k",
@@ -403,8 +403,8 @@ def build_parser() -> argparse.ArgumentParser:
403403
parser.add_argument(
404404
"--cache-path",
405405
type=Path,
406-
default=Path.cwd() / ".cache" / "harmacteval_cache.json",
407-
help="Cache file path (default: ./.cache/harmacteval_cache.json).",
406+
default=Path.cwd() / ".cache" / "agentharmbench_cache.json",
407+
help="Cache file path (default: ./.cache/agentharmbench_cache.json).",
408408
)
409409
parser.add_argument(
410410
"--output",
Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
"""HarmActEval CLI.
1+
"""AgentHarmBench CLI.
22
33
Run with:
4-
python -m agent_action_guard.harmacteval
4+
python -m agent_action_guard.agentharmbench
55
"""
66

77
from __future__ import annotations
@@ -34,7 +34,7 @@
3434
tools: Dict[str, Dict[str, Any]] = {}
3535

3636

37-
class HarmActEvalContext:
37+
class AgentHarmBenchContext:
3838
"""Container for runtime settings and cache state."""
3939

4040
def __init__(
@@ -83,7 +83,7 @@ def parse_bool_env(value: str) -> bool:
8383
return value.upper().startswith("T") or value.startswith("1")
8484

8585

86-
def build_context(cache_path: Path) -> HarmActEvalContext:
86+
def build_context(cache_path: Path) -> AgentHarmBenchContext:
8787
"""Initialize API client and runtime config from environment variables."""
8888
if os.getenv("AZURE_OPENAI_ENDPOINT"):
8989
client = openai.AzureOpenAI()
@@ -116,7 +116,7 @@ def build_context(cache_path: Path) -> HarmActEvalContext:
116116
"MCP_SUPPORTED not set; defaulting to False (local tool emulation)."
117117
)
118118

119-
return HarmActEvalContext(
119+
return AgentHarmBenchContext(
120120
client=client,
121121
model=model,
122122
model_key=model_key,
@@ -142,7 +142,7 @@ def call_tool(*args: Any, tool_name: str = "", **kwargs: Any) -> Any:
142142
return log_text
143143

144144

145-
def add_tool(context: HarmActEvalContext, tool_definition: Dict[str, Any]) -> None:
145+
def add_tool(context: AgentHarmBenchContext, tool_definition: Dict[str, Any]) -> None:
146146
if not context.mcp_supported:
147147
tools[tool_definition["name"]] = tool_definition
148148
return
@@ -158,7 +158,7 @@ def add_tool(context: HarmActEvalContext, tool_definition: Dict[str, Any]) -> No
158158
raise ValueError(f"Error adding tool: {data['message']}")
159159

160160

161-
def get_execution_count(context: HarmActEvalContext, tool_name: str) -> int:
161+
def get_execution_count(context: AgentHarmBenchContext, tool_name: str) -> int:
162162
if not context.mcp_supported:
163163
return execution_counts.get(tool_name, 0)
164164

@@ -172,7 +172,7 @@ def get_execution_count(context: HarmActEvalContext, tool_name: str) -> int:
172172
return data.get("execution_count", 0)
173173

174174

175-
def remove_tool(context: HarmActEvalContext, tool_definition: Dict[str, Any]) -> None:
175+
def remove_tool(context: AgentHarmBenchContext, tool_definition: Dict[str, Any]) -> None:
176176
if not context.mcp_supported:
177177
tool_name = tool_definition["name"]
178178
tools.pop(tool_name, None)
@@ -250,7 +250,7 @@ def create_tool_from_action(action: Dict[str, Any]) -> Dict[str, Any]:
250250

251251

252252
def get_response(
253-
context: HarmActEvalContext,
253+
context: AgentHarmBenchContext,
254254
prompt: str,
255255
attempt: int = 0,
256256
tool: Optional[Dict[str, Any]] = None,
@@ -351,13 +351,13 @@ def get_response(
351351

352352

353353
def evaluate(
354-
context: HarmActEvalContext,
354+
context: AgentHarmBenchContext,
355355
k: int = 2,
356356
offset: int = 0,
357357
limit: Optional[int] = None,
358358
output_path: Optional[Path] = None,
359359
) -> Dict[str, Any]:
360-
"""Run HarmActEval and return a summary dictionary."""
360+
"""Run AgentHarmBench and return a summary dictionary."""
361361
all_harmful_actions = load_dataset()
362362
actions = slice_actions(all_harmful_actions, offset=offset, limit=limit)
363363
logger.info(
@@ -498,7 +498,7 @@ def evaluate(
498498

499499
def build_parser() -> argparse.ArgumentParser:
500500
parser = argparse.ArgumentParser(
501-
description="Run HarmActEval to measure harmful tool-call behavior (Harm@k)."
501+
description="Run AgentHarmBench to measure harmful tool-call behavior (Harm@k)."
502502
)
503503
parser.add_argument(
504504
"--k",
@@ -521,8 +521,8 @@ def build_parser() -> argparse.ArgumentParser:
521521
parser.add_argument(
522522
"--cache-path",
523523
type=Path,
524-
default=Path.cwd() / ".cache" / "harmacteval_cache.json",
525-
help="Cache file path (default: ./.cache/harmacteval_cache.json).",
524+
default=Path.cwd() / ".cache" / "agentharmbench_cache.json",
525+
help="Cache file path (default: ./.cache/agentharmbench_cache.json).",
526526
)
527527
parser.add_argument(
528528
"--output",

examples/mcp_eval_server.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
44
This script provides an MCP server through an HTTP Stream transport server.
55
The server allows dynamic addition and removal of tools via HTTP endpoints,
6-
allowing seamless evaluation through the HarmActEval framework's Python script.
6+
allowing seamless evaluation through the AgentHarmBench framework's Python script.
77
"""
88

99
import os

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ include-package-data = true
6464
agent_action_guard = ["action_classifier_model.pt", "harmactions_dataset.json"]
6565

6666
[project.optional-dependencies]
67-
harmacteval = [
67+
agentharmbench = [
6868
"python-dotenv",
6969
"requests",
7070
]

0 commit comments

Comments
 (0)