You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Autonomous pentesting agent using feedback-driven iteration**
6
-
Achieves ~78% on XBOW benchmarks with fully local execution and model-agnostic architecture.
6
+
Achieves **~80%** on the full XBOW validation benchmark with **Kimi K2.5** at **~US$122** total API cost for that end-to-end run, with a model-agnostic architecture that supports other deployable LLMs.
7
7
8
8
9
9

@@ -54,7 +54,7 @@ Deadend CLI is an autonomous web application penetration testing agent that uses
54
54
- ADaPT-based architecture with supervisor-subagent hierarchy
**Benchmark results:**78% on XBOW validation suite (76/98 challenges), including blind SQL injection exploits where other agents achieved 0%.
57
+
**Benchmark results:****~80%** on the XBOW validation suite with **Kimi K2.5** at **~US$122** total cost for the full benchmark run, including blind SQL injection exploits where other agents achieved 0%.
58
58
59
59
[Read the architecture breakdown in our technical article →](https://xoxruns.medium.com/feedback-driven-iteration-and-fully-local-webapp-pentesting-ai-agent-achieving-78-on-xbow-199ef719bf01)
60
60
@@ -75,16 +75,9 @@ The agent uses a two-phase approach (reconnaissance → exploitation) with a sup
75
75
76
76
> **Note**: To visualize the benchmark results properly, install an ANSI colors extension (e.g., [ANSI Colors](https://marketplace.visualstudio.com/items?itemName=iliazeus.vscode-ansi) for VS Code) to render the rich output.
77
77
78
-
Evaluated on XBOW's 104-challenge validation suite (black-box mode, January 2026):
78
+
Evaluated on XBOW's 104-challenge validation suite (black-box mode, January 2026).
| Cyber-AutoAgent | 85% (This is the latest Cyber-Autoagent scoring for october 2025) <s>81%</s>| AWS Bedrock | 0% |
84
-
|**Deadend CLI**|**78%**|**Fully local**|**33%**|
85
-
| MAPTA | 76.9% | External APIs | 0% |
86
-
87
-
**Models tested:** Claude Sonnet 4.5 (~78%), Kimi K2 Thinking (~69%)
80
+
**Models latest results:** Kimi K2.5 (~80%, ~US$122 for the full 104-challenge XBOW validation run), **GLM-5 (Zhipu AI)**—also very strong in practice.
@@ -105,13 +98,18 @@ The following models have been tested with Deadend CLI. Compatibility and perfor
105
98
**Moonshot AI**
106
99
-**Models**: `Kimi-K2-Thinking`, `Kimi-K2.5`
107
100
-**Status**: Works excellently across all features
108
-
-**Notes**: Reliable performance at every step of the workflow
101
+
-**Notes**: Reliable performance at every step of the workflow. **Kimi K2.5** achieved **~80%** on the full XBOW validation benchmark at **~US$122** total cost for that run.
109
102
110
103
**Anthropic**
111
104
-**Models**: Claude Sonnet 4.5, Claude 3 Opus, Claude 3 Haiku
112
105
-**Status**: Powerful models with excellent results
113
106
-**Notes**: Properly extracts results and token usage information. Recommended for production use.
114
107
108
+
**Zhipu AI**
109
+
-**Models**: `GLM-5` (and related GLM series where supported)
110
+
-**Status**: Works very well with Deadend CLI
111
+
-**Notes**: **GLM-5** from Zhipu AI is **really good** for this workflow—among the standouts alongside Kimi and Claude for reasoning and tool use.
112
+
115
113
**DeepSeek**
116
114
-**Models**: DeepSeek models via various providers
117
115
-**Status**: Functional but with limitations
@@ -122,7 +120,7 @@ The following models have been tested with Deadend CLI. Compatibility and perfor
122
120
-**Status**: Under investigation
123
121
-**Notes**: Some issues observed with tool execution via LiteLLM. Requires further investigation before definitive compatibility assessment.
124
122
125
-
> **Tip**: For best results, we recommend using Moonshot AI (Kimi models) or Anthropic (Claude) models, which have been thoroughly tested and show excellent compatibility with all Deadend CLI features.
123
+
> **Tip**: For best results, we recommend Moonshot AI (Kimi), Anthropic (Claude), or **Zhipu AI (GLM-5)**—all thoroughly exercised on Deadend CLI and strong across the workflow.
126
124
127
125
## 🔧 Custom Pentesting Tools
128
126
@@ -407,7 +405,7 @@ The CLI interface reads from `settings.json` to determine which model to use by
407
405
408
406
### Stable (v0.1.0)
409
407
- ✅ New architecture
410
-
- ✅ XBOW benchmark evaluation (78%)
408
+
- ✅ XBOW benchmark evaluation (~80% with Kimi K2.5, ~US$122 for the full suite)
411
409
- ✅ Custom sandboxed tools
412
410
- ✅ Multi-model support with liteLLM
413
411
- ✅ Two-phase execution (recon + exploitation)
@@ -428,7 +426,7 @@ The CLI interface reads from `settings.json` to determine which model to use by
428
426
429
427
430
428
### Future roadmap
431
-
The current architecture proves competitive autonomous pentesting (78%) is achievable without cloud dependencies. Next challenges:
429
+
The current architecture proves competitive autonomous pentesting is achievable on XBOW at **~80%** with **Kimi K2.5** (**~US$122** for the full validation run). Next challenges:
432
430
433
431
-**Open-Source Models**: Achieve 75%+ with Llama/Qwen (eliminate proprietary dependencies)
434
432
-**Hybrid Testing**: Add AST analysis for white-box code inspection
0 commit comments