diff --git a/transformation-config/skills/web-analytics/config.yaml b/transformation-config/skills/web-analytics/config.yaml new file mode 100644 index 0000000..f09e3cc --- /dev/null +++ b/transformation-config/skills/web-analytics/config.yaml @@ -0,0 +1,10 @@ +type: docs-only +template: description.md +description: Audit a PostHog web analytics setup for misconfigurations +tags: [web-analytics, audit] +shared_docs: [] +variants: + - id: doctor + display_name: Web Analytics Doctor + tags: [] + docs_urls: [] diff --git a/transformation-config/skills/web-analytics/description.md b/transformation-config/skills/web-analytics/description.md new file mode 100644 index 0000000..da2fe03 --- /dev/null +++ b/transformation-config/skills/web-analytics/description.md @@ -0,0 +1,115 @@ +# PostHog Web Analytics Doctor + +This skill audits an existing PostHog project for common web-analytics misconfigurations. It is a **read-only audit**: it never writes code or PostHog config. The output is a markdown report at `posthog-web-analytics-report.md` and a structured JSON file at `posthog-web-analytics-findings.json`, both summarizing findings with severity and remediation links. + +## Reference files + +{references} + +## Guiding tenets + +1. **Read-only.** Never modify the user's source code or PostHog project settings. This skill audits; it does not remediate. Remediation is a follow-up the user runs themselves. + +2. **Quote real data.** Every finding must cite the HogQL query that produced it and at least one concrete value (host name, event count, ratio). If a query returns nothing actionable, omit the finding — don't speculate. + +3. **No fabricated severities.** Use the severity rules in `references/checks.md` exactly. If a check's data doesn't pass the threshold for `warning`, it's `info` or omitted. + +4. **Skip checks gracefully.** If a check's prerequisite query fails mid-audit (missing permission, transient error, etc.), skip that check, record it in the `skipped` array of the JSON output, and continue. The JSON file MUST still be written even if some checks didn't run. Do **not** abort unless a hard precondition fails (see Abort statuses). + +5. **Bounded query window.** See `references/checks.md` for the 7-day default and 30-day expansion rule. The window is decided once at pre-flight and reused for every check. + +## Available MCP tools + +- `mcp__posthog__query-run` (HogQL) — primary tool. Use for all event queries. +- `mcp__posthog__project-get` — fetch project settings, including `app_urls` (the user-configured authorized URLs). +- `mcp__posthog__feature-flag-get-definition` — only if a check needs flag context. +- `mcp__posthog__docs-search` — to fetch the latest doc URL for a remediation link. + +## Pre-flight + +Before running any checks, verify the project has web analytics events and decide the analysis window in a single query: + +```sql +SELECT + countIf(timestamp > now() - INTERVAL 7 DAY) AS p7, + count() AS p30 +FROM events +WHERE event = '$pageview' + AND timestamp > now() - INTERVAL 30 DAY +``` + +- If `p30 = 0`, emit `[ABORT] No web analytics events`. The wizard middleware catches `[ABORT]` and terminates the run cleanly — do not halt yourself. +- If `p7 >= 100`, use a 7-day window for every check. Otherwise use 30 days and set `expandedFromDefault: true` in the JSON output. +- If the query returns a permission error, emit `[ABORT] Insufficient permissions`. + +Do not re-query to decide the window per check — the decision is made once here and reused. + +## How to run the audit + +Run ALL checks in `references/checks.md` without pausing for user confirmation. The user will review the final report. + +The checks share no state — issue their `query-run` calls in parallel when your tool harness supports concurrent calls. Note that Checks 3 and 4 share a single combined query (see `checks.md`); run that query once and apply both pass/fail rules to the result. + +For each check: + +1. Read the check's HogQL query and pass/fail rule. +2. Run the query verbatim via `query-run`. Do not modify the query (other than swapping `INTERVAL 7 DAY` for `INTERVAL 30 DAY` if pre-flight selected the 30-day window). +3. Apply the rule. If it fires, capture: severity (per the rule), host(s) affected, raw counts, and the remediation doc URL listed in the check. +4. Report `[STATUS] Running check N: ` before each check. +5. Append the finding (or "passed") to your in-memory findings list. + +Each check is independent and required. Do not skip a check based on intermediate findings. Do not invent new checks beyond the ones listed. + +## Output + +Produce the audit results in **three places**: + +1. **Write** `posthog-web-analytics-report.md` to the project root, following the format in `references/report-format.md`. Human-readable, archival. +2. **Write** `posthog-web-analytics-findings.json` to the project root, following the schema in `references/findings-schema.md`. Machine-readable; the wizard parses this to render the structured findings screen. The file MUST exist — the wizard's report screen depends on it. +3. **Display** the same report contents in chat as plain markdown (no extra commentary, no fenced code block wrapping the whole thing — just the report). This is what the user reads directly when running the skill outside the wizard. + +Both files MUST exist. The JSON and the markdown must describe the same set of findings — they are two views of the same audit. + +The report includes: + +- A summary section: total findings by severity (critical / warning / info), checks skipped. +- One section per finding: title, severity, affected host(s), evidence (query + counts), remediation steps, doc link. +- A "Checks passed" section listing every check that found no issues, so the user knows the audit was thorough. +- A "Checks skipped" section listing any checks that couldn't run, with the reason. + +After displaying the markdown report, output one final line confirming both file paths (e.g. `Report saved to posthog-web-analytics-report.md and posthog-web-analytics-findings.json`). No further commentary, no recap. + +## Constraints + +- Do NOT modify any source files. +- Do NOT write to PostHog (no creating dashboards, insights, actions, etc.). +- Do NOT run queries against more than 30 days of data — performance budget. +- Do NOT include personally identifiable information in the report (no email addresses, no user IDs, no session IDs, no IP addresses — host names, paths, and counts only). +- Do NOT fabricate or estimate values. Only report what the queries return. +- Findings that don't meet a check's threshold are simply omitted, not labeled "OK". (Severity values are defined in `references/checks.md`.) + +## Status + +Report progress with `[STATUS]` prefixed messages: + +- Verifying web analytics events +- Running check 1: Partial reverse proxy +- Running check 2: Dark authorized URLs +- Running check 3: Pageleave coverage per host +- Running check 4: Web Vitals coverage per host +- Running check 5: Duplicate canonical URLs across hosts +- Writing audit report +- Audit complete + +## Abort statuses + +Report abort states with `[ABORT]` prefixed messages — wording must match exactly so the wizard renders the right error UI: + +- `[ABORT] No web analytics events` — pre-flight finds no `$pageview` events in the last 30 days. +- `[ABORT] Insufficient permissions` — `query-run` returns a permissions error on the pre-flight query. + +Stop all further work after emitting `[ABORT]`. + +## Framework guidelines + +{commandments} diff --git a/transformation-config/skills/web-analytics/references/checks.md b/transformation-config/skills/web-analytics/references/checks.md new file mode 100644 index 0000000..69b0fc1 --- /dev/null +++ b/transformation-config/skills/web-analytics/references/checks.md @@ -0,0 +1,171 @@ +# Web Analytics Doctor — Audit Checks + +Each check is independent. Run them in order; a failure in one does not block the others. Severity values are **fixed** — do not adjust them. + +Time window: every query below uses `INTERVAL 7 DAY`. The pre-flight in `description.md` decides once whether to use 7 days or 30 days for the entire run; if it picked 30, swap `INTERVAL 7 DAY` for `INTERVAL 30 DAY` everywhere in this file before running. + +--- + +## Check 1 — Partial reverse proxy + +**What it detects:** Some hosts in the project route via a reverse proxy (so events survive ad blockers); others go directly to PostHog (where ad blockers can drop events). This is the support-case pattern: `example.com` proxied, `go.example.com` not. + +**Query:** + +```sql +SELECT + properties.$host AS host, + countIf( + coalesce(properties.$lib_custom_api_host, '') = '' + OR coalesce(properties.$lib_custom_api_host, '') LIKE '%i.posthog.com%' + OR coalesce(properties.$lib_custom_api_host, '') LIKE '%posthog.com%' + ) AS direct, + countIf( + coalesce(properties.$lib_custom_api_host, '') != '' + AND coalesce(properties.$lib_custom_api_host, '') NOT LIKE '%posthog.com%' + ) AS proxied, + count() AS total +FROM events +WHERE event = '$pageview' + AND timestamp > now() - INTERVAL 7 DAY + AND properties.$host IS NOT NULL + AND properties.$host != '' +GROUP BY host +HAVING total >= 100 +ORDER BY total DESC +LIMIT 50 +``` + +> **HogQL NULL gotcha:** in HogQL, `NULL != ''` evaluates to TRUE (not NULL as in standard SQL), so an unwrapped `properties.$lib_custom_api_host != ''` will silently match every row where the property is missing. Always wrap the property in `coalesce(..., '')` before comparing or using `LIKE`/`NOT LIKE`. + +**Pass/fail rule:** +- **Warning** if any host has `proxied >= 1` AND any other host has `proxied = 0 AND direct >= 1`, AND the two hosts share a common registrable parent domain (see "Parent-domain heuristic" below). +- **Info** if `proxied` and `direct` hosts exist but parent domains differ — cross-environment mixing (e.g. dev vs prod) is plausible and shouldn't trigger a warning. +- Otherwise: pass. + +**Parent-domain heuristic:** strip leading `www.`, then take the rightmost two dot-separated labels (so `go.example.com` and `app.example.com` both yield `example.com`). For these compound public suffixes, take the rightmost three labels instead: `co.uk`, `com.au`, `co.jp`, `co.nz`, `com.br`, `github.io`, `vercel.app`, `netlify.app`, `pages.dev`. This is a heuristic, not a public-suffix list — when in doubt, downgrade to **Info**. + +**Evidence to include in finding:** the proxied host(s), the direct host(s), each with their event counts. + +**Remediation:** https://posthog.com/docs/advanced/proxy + +--- + +## Check 2 — Dark authorized URLs + +**What it detects:** The user configured authorized URLs in PostHog (via project settings), but one or more of those URLs has zero events in the analysis window. This often means an ad blocker is silently eating events for that domain — or the SDK was never installed there. + +**Step A — fetch authorized URLs:** call `mcp__posthog__project-get` and read `app_urls` from the response. + +**Step B — query event volume per known host:** + +```sql +SELECT + properties.$host AS host, + count() AS pageviews +FROM events +WHERE event = '$pageview' + AND timestamp > now() - INTERVAL 7 DAY +GROUP BY host +ORDER BY pageviews DESC +LIMIT 200 +``` + +**Step C — compare:** for each entry in `app_urls`, parse out the hostname and check it against the query results in memory (do not issue one query per URL). + +Define `total_project_pageviews` = sum of `pageviews` across every row returned by Step B (all hosts, not just authorized ones). + +**Pass/fail rule:** +- **Warning** for each authorized URL whose hostname is absent from Step B's results entirely. +- **Warning** for each authorized URL whose `pageviews / total_project_pageviews < 0.01`, when at least one *other* authorized URL has `pageviews / total_project_pageviews >= 0.10`. (The peer threshold avoids firing on projects where every authorized URL is low-volume.) +- Skip this check if `app_urls` is empty (note in report). + +**Evidence:** the dark host name, total project pageviews for context, and any peer authorized hosts with healthy volume. + +**Remediation:** https://posthog.com/docs/web-analytics/faq + +--- + +## Checks 3 & 4 — Pageleave and Web Vitals coverage per host + +These two checks share a single combined query — run it once and apply both pass/fail rules to the result. + +**What Check 3 detects:** A host emits `$pageview` but few or no `$pageleave` events. Pageleave drives bounce rate and session duration — without it, web analytics dashboards under-report engagement for that domain. + +**What Check 4 detects:** A host has pageviews but no `$web_vitals` events, meaning LCP/CLS/INP performance metrics aren't captured. + +**Combined query:** + +```sql +SELECT + properties.$host AS host, + countIf(event = '$pageview') AS pageviews, + countIf(event = '$pageleave') AS pageleaves, + countIf(event = '$web_vitals') AS web_vitals, + round(countIf(event = '$pageleave') / nullif(countIf(event = '$pageview'), 0), 3) AS pageleave_ratio +FROM events +WHERE event IN ('$pageview', '$pageleave', '$web_vitals') + AND timestamp > now() - INTERVAL 7 DAY + AND properties.$host IS NOT NULL + AND properties.$host != '' +GROUP BY host +HAVING pageviews >= 100 +ORDER BY pageviews DESC +LIMIT 50 +``` + +**Check 3 — Pageleave pass/fail rule:** +- **Info** for any host with `pageleaves = 0` (and `pageviews >= 100`, which the `HAVING` already enforces). +- **Warning** for any host with `pageleaves > 0 AND pageleave_ratio < 0.5` (less than half as many pageleaves as pageviews). Mutually exclusive with the Info rule above — never emit both for the same host. +- **Evidence:** host, pageview count, pageleave count, ratio. +- **Remediation:** https://posthog.com/docs/libraries/js#config — set `capture_pageleave: true`. + +**Check 4 — Web Vitals pass/fail rule:** +- **Info** (not warning — many setups intentionally skip vitals) for any host with `web_vitals = 0`. +- **Evidence:** host, pageview count. +- **Remediation:** https://posthog.com/docs/web-analytics/web-vitals + +Emit Check 3 and Check 4 as separate findings (with their own `checkId` values per `findings-schema.md`) even though they came from one query. + +--- + +## Check 5 — Duplicate canonical URLs across hosts + +**What it detects:** The same path (e.g. `/pricing`) is tracked under multiple `$host` values. This often means you have a marketing site and an app subdomain that both render the same content but get separate analytics — a sign that one is leaking or that proxy/canonical config is inconsistent. + +**Query:** + +```sql +SELECT + properties.$pathname AS path, + groupUniqArray(properties.$host) AS hosts, + count() AS pageviews +FROM events +WHERE event = '$pageview' + AND timestamp > now() - INTERVAL 7 DAY + AND properties.$host IS NOT NULL + AND properties.$host != '' + AND properties.$pathname IS NOT NULL + AND properties.$pathname != '' +GROUP BY path +HAVING length(hosts) >= 2 AND pageviews >= 100 +ORDER BY pageviews DESC +LIMIT 25 +``` + +**Pass/fail rule:** +- **Info** if any path appears under ≥ 2 hosts with combined `pageviews >= 100`. + +**Evidence:** the path, the list of hosts, combined pageview count. + +**Remediation:** https://posthog.com/docs/web-analytics/faq — review `$current_url` / `$host` capture, confirm reverse proxy and canonicalization across all domains. + +--- + +## Severity ladder + +| Severity | When to use | +| --- | --- | +| `critical` | Reserved — none of the current checks emit critical. Future checks may. | +| `warning` | Active misconfiguration likely causing data loss. Action required. | +| `info` | Possibly intentional, but worth surfacing. The user decides. | diff --git a/transformation-config/skills/web-analytics/references/findings-schema.md b/transformation-config/skills/web-analytics/references/findings-schema.md new file mode 100644 index 0000000..4678fdd --- /dev/null +++ b/transformation-config/skills/web-analytics/references/findings-schema.md @@ -0,0 +1,137 @@ +# Web Analytics Doctor — Findings JSON Schema + +The audit MUST emit `posthog-web-analytics-findings.json` at the project root with this exact shape. The wizard reads this file to render its structured findings screen — schema drift breaks the UX. + +## Top-level shape + +```json +{ + "schemaVersion": 1, + "auditedAt": "2026-05-04T15:30:00Z", + "windowDays": 7, + "expandedFromDefault": false, + "findings": [ ... ], + "passed": [ ... ], + "skipped": [ ... ] +} +``` + +| Field | Type | Description | +| --- | --- | --- | +| `schemaVersion` | integer | Always `1` for this version. Bumped if the schema changes. | +| `auditedAt` | ISO 8601 string | UTC timestamp when the audit ran. | +| `windowDays` | integer | The analysis window actually used (7 or 30). | +| `expandedFromDefault` | boolean | `true` if you expanded from 7 days because the 7-day window was sparse. | +| `findings` | array | Each entry is one warning / info / critical finding. See below. | +| `passed` | array of strings | `checkId` of each check that ran cleanly with no findings. | +| `skipped` | array | Each entry: `{checkId, reason}`. Skipped because a precondition was missing (e.g. no app_urls). | + +## Finding entry + +```json +{ + "checkId": "partial_reverse_proxy", + "severity": "warning", + "title": "Mixed proxy configuration across hosts", + "affected": ["example.com", "go.example.com"], + "evidence": [ + "example.com: 385.9k pageviews, 0 proxied", + "go.example.com: 1.0k pageviews, 100% proxied" + ], + "remediationUrl": "https://posthog.com/docs/advanced/proxy" +} +``` + +| Field | Type | Description | +| --- | --- | --- | +| `checkId` | string | Stable enum from the table below. Snake_case. | +| `severity` | `"critical"` \| `"warning"` \| `"info"` | Per the rule in `checks.md`. Do not improvise. | +| `title` | string | Short human-readable headline (≤ 80 chars). | +| `affected` | array of strings | Host names, paths, or URLs the finding concerns. May be empty for project-wide findings. | +| `evidence` | array of strings | Concrete data points. Each ≤ 120 chars. Counts above 1,000 rounded to thousands (e.g. `385.9k`). No PII. | +| `remediationUrl` | string (URL) | The doc link from `checks.md` for this check. | + +## Stable `checkId` values + +These are the enum values the wizard knows about. Use them exactly: + +- `partial_reverse_proxy` — Check 1 +- `dark_authorized_urls` — Check 2 +- `pageleave_coverage` — Check 3 +- `web_vitals_coverage` — Check 4 +- `duplicate_canonical_urls` — Check 5 + +If you add a new check in the future, pick a snake_case ID and update this table — the wizard's metadata map needs to know about it. + +## Empty / healthy project + +If no findings fire and no checks were skipped: + +```json +{ + "schemaVersion": 1, + "auditedAt": "2026-05-04T15:30:00Z", + "windowDays": 7, + "expandedFromDefault": false, + "findings": [], + "passed": [ + "partial_reverse_proxy", + "dark_authorized_urls", + "pageleave_coverage", + "web_vitals_coverage", + "duplicate_canonical_urls" + ], + "skipped": [] +} +``` + +## Worked example + +A run against a project with the partial-proxy and pageleave issues, where `dark_authorized_urls` was skipped because `app_urls` is empty: + +```json +{ + "schemaVersion": 1, + "auditedAt": "2026-05-04T15:30:00Z", + "windowDays": 30, + "expandedFromDefault": true, + "findings": [ + { + "checkId": "partial_reverse_proxy", + "severity": "warning", + "title": "Mixed proxy configuration across hosts", + "affected": ["app.example.com", "go.example.com"], + "evidence": [ + "app.example.com: 50.2k pageviews, 100% proxied", + "go.example.com: 12.8k pageviews, 0 proxied" + ], + "remediationUrl": "https://posthog.com/docs/advanced/proxy" + }, + { + "checkId": "pageleave_coverage", + "severity": "warning", + "title": "Pageleave coverage low on go.example.com", + "affected": ["go.example.com"], + "evidence": ["12.8k pageviews, 1.2k pageleaves, ratio 0.094"], + "remediationUrl": "https://posthog.com/docs/libraries/js#config" + } + ], + "passed": [ + "web_vitals_coverage", + "duplicate_canonical_urls" + ], + "skipped": [ + {"checkId": "dark_authorized_urls", "reason": "no app_urls configured"} + ] +} +``` + +## Validation + +The wizard validates this file with Zod on read. Common failures: + +- Unknown `checkId` → wizard treats as known but uses a generic title. +- Unknown `severity` → wizard rejects the entire file and shows an error. +- Missing required fields → wizard rejects the entire file. + +If you can't produce all required fields (e.g. a check failed to run), prefer the `skipped` array over emitting a malformed finding. diff --git a/transformation-config/skills/web-analytics/references/report-format.md b/transformation-config/skills/web-analytics/references/report-format.md new file mode 100644 index 0000000..0d385f5 --- /dev/null +++ b/transformation-config/skills/web-analytics/references/report-format.md @@ -0,0 +1,67 @@ +# Web Analytics Doctor — Report Format + +Write the report to `posthog-web-analytics-report.md` at the project root. + +## Required structure + +```markdown +# PostHog Web Analytics Audit + +_Audit run: _ +_Window analyzed: _ +_Project: _ + +## Summary + +- **Warnings:** N +- **Info:** N +- **Checks passed:** N +- **Checks skipped:** N + +## Findings + +### + +- **Severity:** warning | info +- **Check:** +- **Affected:** +- **Evidence:** + - + - +- **Remediation:** + +(Repeat per finding, ordered: warnings first, then info.) + +## Checks passed + +- ✓ + +## Checks skipped + +- ✗ + +## Next steps + +<2–3 sentences pointing the user at the most impactful finding to fix first, or a clean-bill-of-health note if there are no findings.> +``` + +## Tone & content rules + +- **Be specific.** Cite host names and counts. "example.com: 12,403 pageviews, 0 proxied" beats "some hosts have proxy gaps." +- **Don't suggest code changes.** This is an audit. Point at the doc URL; the user's wizard or their own engineer handles remediation. +- **No PII.** Hosts, paths, and counts only. Never include user IDs, email addresses, distinct IDs, or specific session IDs in evidence. +- **Round counts** above 1,000 to thousands (12,403 → 12.4k) to avoid implying false precision. +- **No emojis** beyond the `✓` and `✗` markers in the passed/skipped lists. +- **Order findings** by severity (warnings before info), then by impact (highest event volume first). + +## When there are zero findings + +Replace the `Findings` section with: + +```markdown +## Findings + +No issues found. Web analytics setup looks healthy across all checks. +``` + +Still include `Checks passed` so the user can see what was verified.