Evaluation suite for measuring the accuracy of the Patch dark pattern detector. Supports both standard pattern detection and detailed field-level validation.
# Run detailed evals (validates severity, selectors, descriptions, etc.)
pnpm eval
# Run detailed evals using prompts/latest.txt (skips backend prompt regeneration)
pnpm eval:latest
# Run page evals only (pattern type + count validation)
pnpm eval:pages
# Run snippet evals only
pnpm eval:snippets
# Run all evals (snippets + pages)
pnpm eval:allNote: No backend server needs to be running. Evals call the LLM directly — just ensure
backend/.envis configured with a validLLM_API_KEY.
Validates complete LLM output including severity, descriptions, CSS selectors, evidence, confidence, and fix suggestions. Uses intelligent pattern matching with Jaccard similarity scoring for text fields, requiring minimum 0.5 similarity threshold. Ideal for:
- Validating LLM output quality (descriptions, selectors, fixes)
- Regression testing after prompt or model changes
- A/B testing different prompts for accuracy
# Run detailed evals (default)
pnpm eval
# With verbose output (shows field-level matches)
pnpm eval -- -v
# With a custom/experimental prompt
pnpm eval -- -S prompts/v4-detailed.txt
# Direct command (from within evals/)
pnpm exec tsx src/cli.ts detailedTests pattern type detection and counts. Use for quick validation and performance benchmarks.
# Run page evals
pnpm eval:pages
# Run snippet evals
pnpm eval:snippets
# Run all (snippets + pages)
pnpm eval:all
# Filter by pattern category (run from within evals/)
pnpm exec tsx src/cli.ts run --type snippets -c Confirmshaming -v| Option | Description | Default |
|---|---|---|
-o, --output |
Output JSON path | auto-generated |
--no-output |
Disable file output | - |
-s, --system-prompt |
Custom system prompt (inline) | - |
-S, --system-prompt-file |
Load system prompt from file | - |
--timeout |
Request timeout in milliseconds | 900000 |
-v, --verbose |
Show detailed test output | false |
| Option | Description | Default |
|---|---|---|
-t, --type |
Eval type: all, snippets, pages |
pages |
-c, --category |
Test specific pattern (e.g., Confirmshaming) |
all |
-o, --output |
Output JSON path | auto-generated |
--no-output |
Disable file output | - |
-s, --system-prompt |
Custom system prompt (inline) | - |
-S, --system-prompt-file |
Load system prompt from file | - |
--timeout |
Request timeout in milliseconds | 900000 |
-v, --verbose |
Show detailed test output | false |
Other commands: list, categories
Test different system prompts to optimise detection accuracy. Save prompts in prompts/ and results are automatically stored in results/ for comparison.
Current prompt:
prompts/latest.txtalways contains the most recent prompt and is used bypnpm eval:latest. Historical versions (v1–v5) are preserved inprompts/for comparison and A/B testing.
# Run evals with the latest prompt (no backend prompt regeneration)
pnpm eval:latest
# Verbose output with the latest prompt
pnpm eval:latest -- -v
# A/B test a specific historical prompt
pnpm eval -- -S prompts/v4-detailed.txt
# Verbose field-level comparison
pnpm eval -- -S prompts/v4-detailed.txt -v
# Test an experimental prompt with standard (snippet/page) evals
pnpm exec tsx src/cli.ts run -S prompts/v3-detailed.txtAdd to dataset/snippets/<pattern>.json:
{
"id": "confirmshaming_009",
"html": "<button>No, I hate discounts</button>",
"expected_patterns": ["Confirmshaming"],
"description": "Example description"
}Add HTML to fixtures/ and define in dataset/pages/complex-pages.json:
{
"id": "my_test",
"fixture_path": "fixtures/my-page.html",
"expected_patterns": [{ "type": "Fake Urgency", "min_count": 1 }],
"expected_total_min": 1
}Create in dataset/pages/<name>.json (any JSON file in pages/ with description/evidence fields on patterns is treated as detailed):
{
"id": "test_detailed",
"fixture_path": "fixtures/test.html",
"strict_matching": false,
"expected_patterns": [
{
"type": "Fake Urgency",
"severity": "medium",
"description": "Countdown timer creates false urgency",
"evidence": "<span id=\"timer-1\">Deal ends in [MM:SS]</span>",
"confidence": 1,
"fixes": [
{
"selector": "#timer-1",
"strategy": "hide",
"fix": ""
}
]
}
]
}Severity guidance (align with the system prompt):
"high": Direct financial harm – hidden subscriptions, deceptive checkouts"medium": Pressure without direct financial harm – timers, scarcity badges, confirmshaming"low": Mild/borderline – minor visual tricks, informational friction
Evidence guidance: Use the text or HTML element as a user would see it in the rendered page, not JavaScript source code.
10 recognized dark pattern types: Confirmshaming, Fake Urgency, Fake Scarcity, Fake Social Proof, Hidden Costs, Trick Wording, Nagging, Hidden Subscription, Preselection, Visual Interference
Standard: Precision, Recall, F1 Score, per-category breakdown, latency (avg, p50, p95)
Detailed: Pattern accuracy, field-level match rates (type, severity, description, evidence, selector, confidence), similarity scores
Results saved as JSON in results/ with full config and per-case details.