Renamed HarmActEval to AgentHarmBench

prane-eth · prane-eth · commit dcff712b97fa · 2026-03-22T13:58:36.000+05:30
diff --git a/.dockerignore b/.dockerignore
@@ -51,4 +51,4 @@ train_nn_hyperparam.py
 
 # Exclude sample files if not needed
 run_sample_query.py
-HarmActEval_dataset.json
+AgentHarmBench_dataset.json
diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@ $ ffmpeg -i unused/banner_video.mp4 -vframes 1 project_banner.jpg
 
 ⚠️ When AI agents are provided with a harmful tool and an instruction, they just use it. Popular and high-performing latest LLMs are not an exception.
 
-🤖 AI is perceived as a threat. Increasing usage of agents leads to the usage of harmful tools and harmful usage of tools as proven using **HarmActEval**. Classifying AI agent actions ensures safety and reliability. Action Guard uses a neural network model trained on **HarmActions** dataset to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines. ✅ Safe AI Agents are made possible by Action Guard.
+🤖 AI is perceived as a threat. Increasing usage of agents leads to the usage of harmful tools and harmful usage of tools as proven using **AgentHarmBench**. Classifying AI agent actions ensures safety and reliability. Action Guard uses a neural network model trained on **HarmActions** dataset to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines. ✅ Safe AI Agents are made possible by Action Guard.
 
 <!-- [![YouTube](https://img.shields.io/badge/YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/watch?v=7pNYXv3x7MA) -->
 <!-- [![Blog](https://img.shields.io/badge/Blog-FFFFFF?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/blog/prane-eth/agent-action-guard) -->
@@ -49,32 +49,32 @@ $ ffmpeg -i unused/banner_video.mp4 -vframes 1 project_banner.jpg
 
 ## 🆕 New contributions of Agent-Action-Guard framework:
 1. 📊 **HarmActions**, a structured dataset of safety-labeled agent actions complemented with manipulated prompts that trigger harmful or unethical actions.
-2. 📏 **HarmActEval** benchmark leveraging a new metric "Safe@k."
+2. 📏 **AgentHarmBench** benchmark leveraging a new metric "SafeActions@k."
 3. 🧠 **Action Guard**, a neural classifier trained on HarmActions dataset, designed to label proposed agent actions as potentially harmful or safe, and optimized for real-time deployment in agent loops.
 4. 🔌 MCP integration supporting live action screening using existing MCP servers and clients.
 
-## 📊 HarmActEval Results
+## 📊 AgentHarmBench Results
 
-⚡ Popular and latest LLMs generate harmful actions, proving the need for the action guard and HarmActEval benchmark.
+⚡ Popular and latest LLMs generate harmful actions, proving the need for the action guard and AgentHarmBench benchmark.
 
-| Model                 | Safe@1 score |
-|-----------------------|-------------:|
-| Phi 4 Mini Instruct   | 0.00%        |
-| Granite 4-H-Tiny      | 0.00%        |
+| Model                  | Safe@1 score |
+|------------------------|-------------:|
+| Phi 4 Mini Instruct    | 0.00%        |
+| Granite 4-H-Tiny       | 0.00%        |
 | *Claude Haiku 4.5      | 0.00%        |
 | *Gemini 3.1 Flash Lite | 1.33%        |
-| Ministral 3 (3B)      | 2.67%        |
+| Ministral 3 (3B)       | 2.67%        |
 | *Claude Sonnet 4.6     | 4.00%        |
-| Phi 4 Mini Reasoning  | 5.33%        |
+| Phi 4 Mini Reasoning   | 5.33%        |
 | *GPT-5.3               | 17.33%       |
-| **Average**           | 5.07%        |
+| **Average**            | 5.07%        |
 
 Note: "*" denotes popular proprietary models.
 
-📌 Note: Higher Safe@k score is better.
+📌 Note: Higher SafeActions@k score is better.
 
 ## ✨ Special features:
-- This project introduces "HarmActEval" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
+- This project introduces "AgentHarmBench" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
 - The dataset has been used to train a lightweight neural network model that classifies actions as safe, harmful, or unethical.
 - ⚡ The model is lightweight and can be easily integrated into existing AI agent frameworks like MCP.
 - 🔌 Supports MCP (Model Context Protocol) to allow real-time action classification.
@@ -112,13 +112,13 @@ uv pip install agent-action-guard
 
 🔑 Note: The embedding client accepts an API key via the `EMBEDDING_API_KEY` environment variable (falls back to `OPENAI_API_KEY` if unset). See [.env.example](https://github.com/Pro-GenAI/Agent-Action-Guard/blob/main/.env.example) and [`USAGE.md`](https://github.com/Pro-GenAI/Agent-Action-Guard/blob/main/USAGE.md) for examples.
 
-📦 Install with HarmActEval:
+📦 Install with AgentHarmBench:
 
 ```bash
-pip install "agent-action-guard[harmacteval]"
-python -m agent_action_guard.harmacteval
+pip install "agent-action-guard[agentharmbench]"
+python -m agent_action_guard.agentharmbench
 ```
-> Note: The usage of HarmActEval requires OpenAI API key to be set in the environment variables.
+> Note: The usage of AgentHarmBench requires OpenAI API key to be set in the environment variables.
 
 ### 🏷️ License
 
diff --git a/USAGE.md b/USAGE.md
@@ -17,10 +17,10 @@ source .venv/bin/activate
 pip install agent-action-guard
 ```
 
-Install with HarmActEval CLI support:
+Install with AgentHarmBench CLI support:
 
 ```bash
-pip install "agent-action-guard[harmacteval]"
+pip install "agent-action-guard[agentharmbench]"
 ```
 
 2. Start an embedding server (if not already running).
@@ -155,12 +155,12 @@ PyPI package scope:
 - `pip install agent-action-guard` installs only the runtime classifier modules and model file needed for action classification.
 - Training, evaluation, MCP demo servers, and UI scripts remain in this repository and require the `dev` extras.
 
-### HarmActEval CLI (standalone)
+### AgentHarmBench CLI (standalone)
 
-After installing `agent-action-guard[harmacteval]`, run:
+After installing `agent-action-guard[agentharmbench]`, run:
 
 ```bash
-python -m agent_action_guard.harmacteval --k 3
+python -m agent_action_guard.agentharmbench --k 3
 ```
 
 Common arguments:
diff --git a/agent_action_guard/agentharmbench.py b/agent_action_guard/agentharmbench.py
@@ -1,8 +1,8 @@
 """
-HarmActEval CLI.
+AgentHarmBench CLI.
 
 Run with:
-    python -m agent_action_guard.harmacteval
+    python -m agent_action_guard.agentharmbench
 """
 
 from __future__ import annotations
@@ -39,7 +39,7 @@
 # Wait interval between attempts (seconds) to avoid LLM API rate limits
 WAIT_INTERVAL: int = int(os.getenv("WAIT_INTERVAL", "0"))
 
-cache_path: Path = Path.cwd() / ".cache" / "harmacteval_cache.json"
+cache_path: Path = Path.cwd() / ".cache" / "agentharmbench_cache.json"
 cache: Dict[str, Any] = {}
 
 # -------------------------------------------------------------------
@@ -365,7 +365,7 @@ def evaluate(
     logger.info("Model: %s", model_key)
     logger.info("k: %d", k)
     logger.info("Harm@k score: %.2f%%", percent_harmful)
-    logger.info("Safe@k score: %.2f%%", percent_safe)
+    logger.info("SafeActions@k score: %.2f%%", percent_safe)
     if output_path:
         logger.info("Saved results to %s", output_path)
 
@@ -380,7 +380,7 @@ def evaluate(
 def build_parser() -> argparse.ArgumentParser:
 
     parser = argparse.ArgumentParser(
-        description="Run HarmActEval to measure harmful tool-call behavior."
+        description="Run AgentHarmBench to measure harmful tool-call behavior."
     )
     parser.add_argument(
         "--k",
@@ -403,8 +403,8 @@ def build_parser() -> argparse.ArgumentParser:
     parser.add_argument(
         "--cache-path",
         type=Path,
-        default=Path.cwd() / ".cache" / "harmacteval_cache.json",
-        help="Cache file path (default: ./.cache/harmacteval_cache.json).",
+        default=Path.cwd() / ".cache" / "agentharmbench_cache.json",
+        help="Cache file path (default: ./.cache/agentharmbench_cache.json).",
     )
     parser.add_argument(
         "--output",
diff --git a/examples/agentharmbench_mcp.py b/examples/agentharmbench_mcp.py
@@ -1,7 +1,7 @@
-"""HarmActEval CLI.
+"""AgentHarmBench CLI.
 
 Run with:
-    python -m agent_action_guard.harmacteval
+    python -m agent_action_guard.agentharmbench
 """
 
 from __future__ import annotations
@@ -34,7 +34,7 @@
 tools: Dict[str, Dict[str, Any]] = {}
 
 
-class HarmActEvalContext:
+class AgentHarmBenchContext:
     """Container for runtime settings and cache state."""
 
     def __init__(
@@ -83,7 +83,7 @@ def parse_bool_env(value: str) -> bool:
     return value.upper().startswith("T") or value.startswith("1")
 
 
-def build_context(cache_path: Path) -> HarmActEvalContext:
+def build_context(cache_path: Path) -> AgentHarmBenchContext:
     """Initialize API client and runtime config from environment variables."""
     if os.getenv("AZURE_OPENAI_ENDPOINT"):
         client = openai.AzureOpenAI()
@@ -116,7 +116,7 @@ def build_context(cache_path: Path) -> HarmActEvalContext:
             "MCP_SUPPORTED not set; defaulting to False (local tool emulation)."
         )
 
-    return HarmActEvalContext(
+    return AgentHarmBenchContext(
         client=client,
         model=model,
         model_key=model_key,
@@ -142,7 +142,7 @@ def call_tool(*args: Any, tool_name: str = "", **kwargs: Any) -> Any:
     return log_text
 
 
-def add_tool(context: HarmActEvalContext, tool_definition: Dict[str, Any]) -> None:
+def add_tool(context: AgentHarmBenchContext, tool_definition: Dict[str, Any]) -> None:
     if not context.mcp_supported:
         tools[tool_definition["name"]] = tool_definition
         return
@@ -158,7 +158,7 @@ def add_tool(context: HarmActEvalContext, tool_definition: Dict[str, Any]) -> No
         raise ValueError(f"Error adding tool: {data['message']}")
 
 
-def get_execution_count(context: HarmActEvalContext, tool_name: str) -> int:
+def get_execution_count(context: AgentHarmBenchContext, tool_name: str) -> int:
     if not context.mcp_supported:
         return execution_counts.get(tool_name, 0)
 
@@ -172,7 +172,7 @@ def get_execution_count(context: HarmActEvalContext, tool_name: str) -> int:
     return data.get("execution_count", 0)
 
 
-def remove_tool(context: HarmActEvalContext, tool_definition: Dict[str, Any]) -> None:
+def remove_tool(context: AgentHarmBenchContext, tool_definition: Dict[str, Any]) -> None:
     if not context.mcp_supported:
         tool_name = tool_definition["name"]
         tools.pop(tool_name, None)
@@ -250,7 +250,7 @@ def create_tool_from_action(action: Dict[str, Any]) -> Dict[str, Any]:
 
 
 def get_response(
-    context: HarmActEvalContext,
+    context: AgentHarmBenchContext,
     prompt: str,
     attempt: int = 0,
     tool: Optional[Dict[str, Any]] = None,
@@ -351,13 +351,13 @@ def get_response(
 
 
 def evaluate(
-    context: HarmActEvalContext,
+    context: AgentHarmBenchContext,
     k: int = 2,
     offset: int = 0,
     limit: Optional[int] = None,
     output_path: Optional[Path] = None,
 ) -> Dict[str, Any]:
-    """Run HarmActEval and return a summary dictionary."""
+    """Run AgentHarmBench and return a summary dictionary."""
     all_harmful_actions = load_dataset()
     actions = slice_actions(all_harmful_actions, offset=offset, limit=limit)
     logger.info(
@@ -498,7 +498,7 @@ def evaluate(
 
 def build_parser() -> argparse.ArgumentParser:
     parser = argparse.ArgumentParser(
-        description="Run HarmActEval to measure harmful tool-call behavior (Harm@k)."
+        description="Run AgentHarmBench to measure harmful tool-call behavior (Harm@k)."
     )
     parser.add_argument(
         "--k",
@@ -521,8 +521,8 @@ def build_parser() -> argparse.ArgumentParser:
     parser.add_argument(
         "--cache-path",
         type=Path,
-        default=Path.cwd() / ".cache" / "harmacteval_cache.json",
-        help="Cache file path (default: ./.cache/harmacteval_cache.json).",
+        default=Path.cwd() / ".cache" / "agentharmbench_cache.json",
+        help="Cache file path (default: ./.cache/agentharmbench_cache.json).",
     )
     parser.add_argument(
         "--output",
diff --git a/examples/mcp_eval_server.py b/examples/mcp_eval_server.py
@@ -3,7 +3,7 @@
 
 This script provides an MCP server through an HTTP Stream transport server.
 The server allows dynamic addition and removal of tools via HTTP endpoints,
-allowing seamless evaluation through the HarmActEval framework's Python script.
+allowing seamless evaluation through the AgentHarmBench framework's Python script.
 """
 
 import os
diff --git a/pyproject.toml b/pyproject.toml
@@ -64,7 +64,7 @@ include-package-data = true
 agent_action_guard = ["action_classifier_model.pt", "harmactions_dataset.json"]
 
 [project.optional-dependencies]
-harmacteval = [
+agentharmbench = [
     "python-dotenv",
     "requests",
 ]

Original file line number	Diff line number	Diff line change
`@@ -64,7 +64,7 @@ include-package-data = true`
`64`	`64`	`agent_action_guard = ["action_classifier_model.pt", "harmactions_dataset.json"]`
`65`	`65`
`66`	`66`	`[project.optional-dependencies]`
`67`		`-harmacteval = [`
	`67`	`+agentharmbench = [`
`68`	`68`	`"python-dotenv",`
`69`	`69`	`"requests",`
`70`	`70`	`]`