Skip to content

Commit 402a4a6

Browse files
committed
new results for kimi k2.5 at 80%
1 parent 251162d commit 402a4a6

291 files changed

Lines changed: 1590805 additions & 24 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
[![Discord - Deadend CLI](https://img.shields.io/badge/Discord-Deadend%20CLI-5865F2?logo=discord&logoColor=white)](https://discord.gg/zwUVa3E7KT)
44

55
**Autonomous pentesting agent using feedback-driven iteration**
6-
Achieves ~78% on XBOW benchmarks with fully local execution and model-agnostic architecture.
6+
Achieves **~80%** on the full XBOW validation benchmark with **Kimi K2.5** at **~US$122** total API cost for that end-to-end run, with a model-agnostic architecture that supports other deployable LLMs.
77

88

99
![Deadend CLI](./assets/demo_gif.gif)
@@ -54,7 +54,7 @@ Deadend CLI is an autonomous web application penetration testing agent that uses
5454
- ADaPT-based architecture with supervisor-subagent hierarchy
5555
- Confidence-based decision making (fail <20%, expand 20-60%, refine 60-80%, validate >80%)
5656

57-
**Benchmark results:** 78% on XBOW validation suite (76/98 challenges), including blind SQL injection exploits where other agents achieved 0%.
57+
**Benchmark results:** **~80%** on the XBOW validation suite with **Kimi K2.5** at **~US$122** total cost for the full benchmark run, including blind SQL injection exploits where other agents achieved 0%.
5858

5959
[Read the architecture breakdown in our technical article →](https://xoxruns.medium.com/feedback-driven-iteration-and-fully-local-webapp-pentesting-ai-agent-achieving-78-on-xbow-199ef719bf01)
6060

@@ -75,16 +75,9 @@ The agent uses a two-phase approach (reconnaissance → exploitation) with a sup
7575

7676
> **Note**: To visualize the benchmark results properly, install an ANSI colors extension (e.g., [ANSI Colors](https://marketplace.visualstudio.com/items?itemName=iliazeus.vscode-ansi) for VS Code) to render the rich output.
7777
78-
Evaluated on XBOW's 104-challenge validation suite (black-box mode, January 2026):
78+
Evaluated on XBOW's 104-challenge validation suite (black-box mode, January 2026).
7979

80-
| Agent | Success Rate | Infrastructure | Blind SQLi |
81-
|-------|-------------|----------------|------------|
82-
| XBOW (proprietary) | 85% | Proprietary | ? |
83-
| Cyber-AutoAgent | 85% (This is the latest Cyber-Autoagent scoring for october 2025) <s>81%</s>| AWS Bedrock | 0% |
84-
| **Deadend CLI** | **78%** | **Fully local** | **33%** |
85-
| MAPTA | 76.9% | External APIs | 0% |
86-
87-
**Models tested:** Claude Sonnet 4.5 (~78%), Kimi K2 Thinking (~69%)
80+
**Models latest results:** Kimi K2.5 (~80%, ~US$122 for the full 104-challenge XBOW validation run), **GLM-5 (Zhipu AI)**—also very strong in practice.
8881

8982
Strong performance: XSS (91%), Business Logic (86%), SQL injection (83%), IDOR (80%)
9083
Perfect scores: GraphQL, SSRF, NoSQL injection, HTTP method tampering (100%)
@@ -105,13 +98,18 @@ The following models have been tested with Deadend CLI. Compatibility and perfor
10598
**Moonshot AI**
10699
- **Models**: `Kimi-K2-Thinking`, `Kimi-K2.5`
107100
- **Status**: Works excellently across all features
108-
- **Notes**: Reliable performance at every step of the workflow
101+
- **Notes**: Reliable performance at every step of the workflow. **Kimi K2.5** achieved **~80%** on the full XBOW validation benchmark at **~US$122** total cost for that run.
109102

110103
**Anthropic**
111104
- **Models**: Claude Sonnet 4.5, Claude 3 Opus, Claude 3 Haiku
112105
- **Status**: Powerful models with excellent results
113106
- **Notes**: Properly extracts results and token usage information. Recommended for production use.
114107

108+
**Zhipu AI**
109+
- **Models**: `GLM-5` (and related GLM series where supported)
110+
- **Status**: Works very well with Deadend CLI
111+
- **Notes**: **GLM-5** from Zhipu AI is **really good** for this workflow—among the standouts alongside Kimi and Claude for reasoning and tool use.
112+
115113
**DeepSeek**
116114
- **Models**: DeepSeek models via various providers
117115
- **Status**: Functional but with limitations
@@ -122,7 +120,7 @@ The following models have been tested with Deadend CLI. Compatibility and perfor
122120
- **Status**: Under investigation
123121
- **Notes**: Some issues observed with tool execution via LiteLLM. Requires further investigation before definitive compatibility assessment.
124122

125-
> **Tip**: For best results, we recommend using Moonshot AI (Kimi models) or Anthropic (Claude) models, which have been thoroughly tested and show excellent compatibility with all Deadend CLI features.
123+
> **Tip**: For best results, we recommend Moonshot AI (Kimi), Anthropic (Claude), or **Zhipu AI (GLM-5)**—all thoroughly exercised on Deadend CLI and strong across the workflow.
126124
127125
## 🔧 Custom Pentesting Tools
128126

@@ -407,7 +405,7 @@ The CLI interface reads from `settings.json` to determine which model to use by
407405

408406
### Stable (v0.1.0)
409407
- ✅ New architecture
410-
- ✅ XBOW benchmark evaluation (78%)
408+
- ✅ XBOW benchmark evaluation (~80% with Kimi K2.5, ~US$122 for the full suite)
411409
- ✅ Custom sandboxed tools
412410
- ✅ Multi-model support with liteLLM
413411
- ✅ Two-phase execution (recon + exploitation)
@@ -428,7 +426,7 @@ The CLI interface reads from `settings.json` to determine which model to use by
428426

429427

430428
### Future roadmap
431-
The current architecture proves competitive autonomous pentesting (78%) is achievable without cloud dependencies. Next challenges:
429+
The current architecture proves competitive autonomous pentesting is achievable on XBOW at **~80%** with **Kimi K2.5** (**~US$122** for the full validation run). Next challenges:
432430

433431
- **Open-Source Models**: Achieve 75%+ with Llama/Qwen (eliminate proprietary dependencies)
434432
- **Hybrid Testing**: Add AST analysis for white-box code inspection

0 commit comments

Comments
 (0)