Skip to content

Commit cf4bb20

Browse files
josephgoksuclaude
andcommitted
docs: add evaluation methodology for benchmark claims
Document the +122% improvement benchmark methodology referenced on the landing page. Covers setup, per-task scores, scoring criteria, what TaskWing provides, and limitations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6f109da commit cf4bb20

1 file changed

Lines changed: 106 additions & 0 deletions

File tree

docs/development/EVALUATION.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# TaskWing Evaluation Methodology
2+
3+
## Overview
4+
5+
We evaluated whether injecting project-specific context via TaskWing
6+
improves the quality of LLM-generated architectural responses compared
7+
to a baseline (no context) scenario.
8+
9+
**Result: +122% improvement** (3.6 → 8.0 average score).
10+
11+
## Setup
12+
13+
| Parameter | Value |
14+
|------------------|--------------------------------------------|
15+
| **Codebase** | Production Go/React monorepo |
16+
| **LLM judge** | gpt-5-mini |
17+
| **Tasks** | 5 architectural questions |
18+
| **Scoring** | 1–10 per task, averaged |
19+
| **Conditions** | Baseline (no context) vs TaskWing-injected |
20+
21+
## Tasks
22+
23+
Each task required the LLM to answer an architectural question about
24+
the codebase. Correct answers required knowing:
25+
26+
1. The primary language (Go, not TypeScript)
27+
2. Correct file paths and project structure
28+
3. Correct build/generate commands
29+
4. Architectural patterns and constraints
30+
5. Technology decisions and their rationale
31+
32+
## Results
33+
34+
### Per-Task Scores
35+
36+
| Task | Without Context | With TaskWing | Delta |
37+
|------|---------------:|-------------:|------:|
38+
| T1 | 6 | 8 | +2 |
39+
| T2 | 3 | 8 | +5 |
40+
| T3 | 3 | 8 | +5 |
41+
| T4 | 3 | 8 | +5 |
42+
| T5 | 3 | 8 | +5 |
43+
| **Avg** | **3.6** | **8.0** | **+4.4** |
44+
45+
**Improvement: +122%** (8.0 / 3.6 - 1)
46+
47+
### Without Context (Baseline)
48+
49+
The LLM without context consistently:
50+
- Assumed TypeScript instead of Go
51+
- Referenced nonexistent files like `src/types/openapi.ts`
52+
- Suggested `npm run generate` instead of `make generate-api`
53+
- Missed architectural constraints entirely
54+
55+
Only T1 scored above 3, likely due to generic reasoning.
56+
57+
### With TaskWing (Context Injected)
58+
59+
TaskWing's MCP integration provided the LLM with:
60+
- **Decisions**: Technology choices and their rationale
61+
- **Patterns**: File structure conventions and API patterns
62+
- **Constraints**: Build requirements and deployment rules
63+
64+
The LLM consistently identified Go, referenced correct file paths
65+
(`internal/api/types.gen.go`), and used correct commands.
66+
67+
## Scoring Criteria
68+
69+
- **8–10**: Correct language, correct paths, correct commands,
70+
respects constraints
71+
- **5–7**: Partially correct; right language but wrong paths,
72+
or right paths but wrong commands
73+
- **1–4**: Wrong language or fundamentally incorrect assumptions
74+
- **Rule**: Wrong tech stack identification = automatic score ≤ 3
75+
76+
## What TaskWing Provides
77+
78+
During the evaluation, TaskWing injected the following context
79+
via the MCP protocol:
80+
81+
```
82+
Decisions: 22 (e.g., "PostgreSQL over MongoDB", "OpenAPI codegen")
83+
Patterns: 12 (e.g., "internal/api/handlers/ convention")
84+
Constraints: 9 (e.g., "No .env in production — use SSM")
85+
```
86+
87+
This context was extracted automatically by `taskwing bootstrap`
88+
in under 3 seconds.
89+
90+
## Reproducing
91+
92+
1. Clone any Go or multi-language repository
93+
2. Run `taskwing bootstrap` to extract context
94+
3. Ask the same architectural questions with and without
95+
TaskWing's MCP server connected
96+
4. Score responses on a 1–10 scale using the criteria above
97+
98+
## Limitations
99+
100+
- Single codebase evaluated (Go/React monorepo)
101+
- Single LLM judge model (gpt-5-mini)
102+
- 5 tasks may not capture all architectural reasoning scenarios
103+
- Scores are relative — absolute quality depends on the model used
104+
105+
We plan to expand this evaluation to more codebases and models
106+
in future iterations.

0 commit comments

Comments
 (0)