Patch Dark Pattern Detector Evals

Evaluation suite for measuring the accuracy of the Patch dark pattern detector. Supports both standard pattern detection and detailed field-level validation.

Quick Start

# Run detailed evals (validates severity, selectors, descriptions, etc.)
pnpm eval

# Run detailed evals using prompts/latest.txt (skips backend prompt regeneration)
pnpm eval:latest

# Run page evals only (pattern type + count validation)
pnpm eval:pages

# Run snippet evals only
pnpm eval:snippets

# Run all evals (snippets + pages)
pnpm eval:all

Note: No backend server needs to be running. Evals call the LLM directly — just ensure backend/.env is configured with a valid LLM_API_KEY.

Detailed Evaluation (Default)

Validates complete LLM output including severity, descriptions, CSS selectors, evidence, confidence, and fix suggestions. Uses intelligent pattern matching with Jaccard similarity scoring for text fields, requiring minimum 0.5 similarity threshold. Ideal for:

Validating LLM output quality (descriptions, selectors, fixes)
Regression testing after prompt or model changes
A/B testing different prompts for accuracy

# Run detailed evals (default)
pnpm eval

# With verbose output (shows field-level matches)
pnpm eval -- -v

# With a custom/experimental prompt
pnpm eval -- -S prompts/v4-detailed.txt

# Direct command (from within evals/)
pnpm exec tsx src/cli.ts detailed

Standard Evaluation

Tests pattern type detection and counts. Use for quick validation and performance benchmarks.

# Run page evals
pnpm eval:pages

# Run snippet evals
pnpm eval:snippets

# Run all (snippets + pages)
pnpm eval:all

# Filter by pattern category (run from within evals/)
pnpm exec tsx src/cli.ts run --type snippets -c Confirmshaming -v

CLI Options

`detailed` Command (Default)

Option	Description	Default
`-o, --output`	Output JSON path	auto-generated
`--no-output`	Disable file output	-
`-s, --system-prompt`	Custom system prompt (inline)	-
`-S, --system-prompt-file`	Load system prompt from file	-
`--timeout`	Request timeout in milliseconds	900000
`-v, --verbose`	Show detailed test output	false

`run` Command

Option	Description	Default
`-t, --type`	Eval type: `all`, `snippets`, `pages`	`pages`
`-c, --category`	Test specific pattern (e.g., `Confirmshaming`)	all
`-o, --output`	Output JSON path	auto-generated
`--no-output`	Disable file output	-
`-s, --system-prompt`	Custom system prompt (inline)	-
`-S, --system-prompt-file`	Load system prompt from file	-
`--timeout`	Request timeout in milliseconds	900000
`-v, --verbose`	Show detailed test output	false

Other commands: list, categories

Prompt Engineering

Test different system prompts to optimise detection accuracy. Save prompts in prompts/ and results are automatically stored in results/ for comparison.

Current prompt: prompts/latest.txt always contains the most recent prompt and is used by pnpm eval:latest. Historical versions (v1–v5) are preserved in prompts/ for comparison and A/B testing.

# Run evals with the latest prompt (no backend prompt regeneration)
pnpm eval:latest

# Verbose output with the latest prompt
pnpm eval:latest -- -v

# A/B test a specific historical prompt
pnpm eval -- -S prompts/v4-detailed.txt

# Verbose field-level comparison
pnpm eval -- -S prompts/v4-detailed.txt -v

# Test an experimental prompt with standard (snippet/page) evals
pnpm exec tsx src/cli.ts run -S prompts/v3-detailed.txt

Adding Test Cases

Snippet Cases (Standard)

Add to dataset/snippets/<pattern>.json:

{
  "id": "confirmshaming_009",
  "html": "<button>No, I hate discounts</button>",
  "expected_patterns": ["Confirmshaming"],
  "description": "Example description"
}

Page Cases (Standard)

Add HTML to fixtures/ and define in dataset/pages/complex-pages.json:

{
  "id": "my_test",
  "fixture_path": "fixtures/my-page.html",
  "expected_patterns": [{ "type": "Fake Urgency", "min_count": 1 }],
  "expected_total_min": 1
}

Detailed Page Cases

Create in dataset/pages/<name>.json (any JSON file in pages/ with description/evidence fields on patterns is treated as detailed):

{
  "id": "test_detailed",
  "fixture_path": "fixtures/test.html",
  "strict_matching": false,
  "expected_patterns": [
    {
      "type": "Fake Urgency",
      "severity": "medium",
      "description": "Countdown timer creates false urgency",
      "evidence": "<span id=\"timer-1\">Deal ends in [MM:SS]</span>",
      "confidence": 1,
      "fixes": [
        {
          "selector": "#timer-1",
          "strategy": "hide",
          "fix": ""
        }
      ]
    }
  ]
}

Severity guidance (align with the system prompt):

"high": Direct financial harm – hidden subscriptions, deceptive checkouts
"medium": Pressure without direct financial harm – timers, scarcity badges, confirmshaming
"low": Mild/borderline – minor visual tricks, informational friction

Evidence guidance: Use the text or HTML element as a user would see it in the rendered page, not JavaScript source code.

Pattern Categories

10 recognized dark pattern types: Confirmshaming, Fake Urgency, Fake Scarcity, Fake Social Proof, Hidden Costs, Trick Wording, Nagging, Hidden Subscription, Preselection, Visual Interference

Metrics

Standard: Precision, Recall, F1 Score, per-category breakdown, latency (avg, p50, p95)

Detailed: Pattern accuracy, field-level match rates (type, severity, description, evidence, selector, confidence), similarity scores

Results saved as JSON in results/ with full config and per-case details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch Dark Pattern Detector Evals

Quick Start

Detailed Evaluation (Default)

Standard Evaluation

CLI Options

`detailed` Command (Default)

`run` Command

Prompt Engineering

Adding Test Cases

Snippet Cases (Standard)

Page Cases (Standard)

Detailed Page Cases

Pattern Categories

Metrics

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Patch Dark Pattern Detector Evals

Quick Start

Detailed Evaluation (Default)

Standard Evaluation

CLI Options

detailed Command (Default)

run Command

Prompt Engineering

Adding Test Cases

Snippet Cases (Standard)

Page Cases (Standard)

Detailed Page Cases

Pattern Categories

Metrics

`detailed` Command (Default)

`run` Command