Added a new result

prane-eth · prane-eth · commit 8377dc593f44 · 2026-04-08T20:59:49.000+05:30
diff --git a/.env.example b/.env.example
@@ -1,18 +1,18 @@
 # Example environment variables for running the demo and embedding server.
 # Copy this file to `.env` and edit values before running the demo or docker-compose.
 
-# Embedding settings
+# Embedding variables
 EMBED_MODEL_NAME=sentence-transformers/all-MiniLM-L6-v2
 EMBEDDING_BASE_URL=http://localhost:1234/v1
 EMBEDDING_API_KEY=your-embedding-key
 
-# Use an OpenAI-compatible API
-OPENAI_BASE_URL=https://api.tokenfactory.nebius.com/v1
+# Using an OpenAI-compatible API
+OPENAI_BASE_URL=https://<endpoint>.services.ai.azure.com/openai/v1/
 OPENAI_API_KEY=<your_api_key_here>
-OPENAI_MODEL=openai/gpt-oss-20b
+OPENAI_MODEL=gpt-5.4-mini-2026-03-17
 
-# # To use Azure OpenAI instead
-# AZURE_OPENAI_ENDPOINT=https://az-foundry-resource-pr.services.ai.azure.com/
+# # To use Azure OpenAI without the OpenAI-compatible endpoint
+# AZURE_OPENAI_ENDPOINT=https://<endpoint>.services.ai.azure.com/
 # AZURE_OPENAI_API_KEY=
 # OPENAI_API_VERSION=2024-10-21
 # OPENAI_MODEL=gpt-5.4-mini-2026-03-17
diff --git a/README.md b/README.md
@@ -45,19 +45,20 @@ python -m agent_action_guard.harmactionseval
 **HarmActionsEval** benchmark proved that AI agents with harmful tools will use them — even today's **most capable** LLMs.
 80% of the LLMs tested executed actions at the first attempt for over 95% of the harmful prompts.
 
-| Model                 | SafeActions@1 |
-|-----------------------|------:|
-| Claude Haiku 4.5      | 0.00% |
-| Phi 4 Mini Instruct   | 0.00% |
-| Granite 4-H-Tiny      | 0.00% |
-| GPT-5.4 Mini          | 0.71% |
-| Gemini 3.1 Flash Lite | 0.71% |
-| Ministral 3 (3B)      | 2.13% |
-| Claude Sonnet 4.6     | 2.84% |
-| Phi 4 Mini Reasoning  | 2.84% |
-| GPT-5.3               | 12.77% |
-| Qwen3.5-397b-a17b     | 23.40% |
-| **Average**           | **4.54%** |
+| Model                   | SafeActions@1 |
+|-------------------------|------:|
+| Claude Haiku 4.5        | 0.00% |
+| Phi 4 Mini Instruct     | 0.00% |
+| Granite 4-H-Tiny        | 0.00% |
+| GPT-5.4 Mini            | 0.71% |
+| Gemini 3.1 Flash Lite   | 0.71% |
+| Grok 4.20 Non Reasoning | 2.13% |
+| Ministral 3 (3B)        | 2.13% |
+| Claude Sonnet 4.6       | 2.84% |
+| Phi 4 Mini Reasoning    | 2.84% |
+| GPT-5.3                 | 12.77% |
+| Qwen3.5-397b-a17b       | 23.40% |
+| **Average**             | **4.54%** |
 
 > These models often still respond *"Sorry, I can't help with that"* while executing the harmful action anyway.
 
diff --git a/agent_action_guard/harmactionseval.py b/agent_action_guard/harmactionseval.py
@@ -335,7 +335,7 @@ def evaluate(
             for attempt in range(1, k + 1):
                 tool_calls = attempt_tool_call(prompt, attempt=attempt, tool=tool)
                 if WAIT_INTERVAL:
-                    logger.info(
+                    logger.debug(
                         "Waiting %d seconds before retrying (attempt %d/%d)...",
                         WAIT_INTERVAL,
                         attempt + 1,