Skip to content

Add browser-reverse skill — OpenAPI 3.1 from browser-trace captures#88

Open
derekmeegan wants to merge 2 commits intomainfrom
add-browser-reverse-skill
Open

Add browser-reverse skill — OpenAPI 3.1 from browser-trace captures#88
derekmeegan wants to merge 2 commits intomainfrom
add-browser-reverse-skill

Conversation

@derekmeegan
Copy link
Copy Markdown
Contributor

@derekmeegan derekmeegan commented Apr 29, 2026

Summary

browser-reverse consumes a browser-trace run directory and emits an OpenAPI 3.1 spec for the publicly-observable HTTP API of any website, plus a human-readable coverage report and per-endpoint confidence metadata. Pure offline post-processing — composes cleanly with the existing browser-trace skill rather than duplicating capture.

Pipeline (each stage is a discrete script for debuggability via --stage):

load → filter → normalize → infer → emit

Highlights

  • Path templating — UUIDs, integers, hex/base62 IDs, plus a second-pass slug detector for varying alpha segments. Multi-param paths get {id}, {id2}, etc.
  • Schema inference (lib/schema-merge.mjs) — JSON-Schema from samples with required-intersection, type unions, format hints (date-time, uri, email, uuid), and enum detection that requires meaningful repetition (not just low cardinality).
  • Component hoisting with $ref — recurses into nested object/array schemas, hoists when referenced ≥ 2 times OR when it's an object with ≥ 4 properties. Names derived from path tokens.
  • Redaction — credentials in headers (Authorization, Cookie, *-token, etc.), in body keys (password, apiKey, etc.), and value patterns (JWTs, emails, phone numbers). Replaces values with <redacted> to preserve types for inference.
  • browse network on integration — pass --bodies <path> (or stash bodies under <run>/cdp/network/bodies/, which is auto-detected) to join real response bodies into the trace by CDP requestId. Without it, the spec has request bodies but no response-body schemas (the browse cdp firehose doesn't embed bodies).
  • Cross-origin path collisions handled — when two origins serve the same (method, path), the higher-sample operation wins and other origins are recorded under x-also-served-from rather than silently dropped.
  • Honest reportingreport.md lists every endpoint with samples, statuses, confidence, and normalization flags (single-sample, single-status, mixed-content-types, divergent-response-shape, request-body-only-on-some-samples).

Composition with browser-trace

node ../browser-trace/scripts/start-capture.mjs 9222 my-site
browse env local 9222
browse network on                                 # capture bodies (recommended)
browse open https://example.com
# ...drive flows...
cp -r "$(browse network path | jq -r .path)" .o11y/my-site/cdp/network/bodies/
browse network off
node ../browser-trace/scripts/stop-capture.mjs my-site
node ../browser-trace/scripts/bisect-cdp.mjs my-site

node scripts/discover.mjs --run .o11y/my-site

End-to-end testing

Pipeline ran clean against six sites; five real bugs surfaced and fixed during this work:

Site Outcome
Hacker News 7 endpoints; query-param type inference (integer vs string on id based on values)
jsonplaceholder.typicode.com 6 endpoints; POST/PUT body schemas, multi-status 200+404, header + body redaction
derekmeegan.com (Next.js) 4 endpoints; _rsc query param, Vercel analytics body schema, mixed-content-types detection
browserbase.com 39 endpoints across 14 origins; multi-param path /pixel/{id}/visitor/{id2}/cerebro, 12 components hoisted
browser-use.com 23 endpoints; discovered /api/md/<slug> LLM-friendly markdown export endpoint
reddit.com 20 endpoints, 30 components, 2 servers; full schema for /svc/shreddit/events (Reddit's internal telemetry, 18 nested types), live ExposeVariant GraphQL exposure capturing experiment names

Bugs surfaced and fixed during E2E:

  1. Enum over-detection — required distinct ≤ floor(samples/2) so unique IDs don't become enums.
  2. Component hoisting silently disabled — {...}.length is undefined; rewrote to use Object.keys(...).length and recurse into nested schemas.
  3. Redaction double-counting — redactBody() called twice per body; redact once and reuse.
  4. YAML emitter producing invalid scalars — @, `, #, etc. as first character now trigger quoting (was breaking on @vercel/analytics/react).
  5. Cross-origin path collision data loss — paths.<path>.<method> is unique in OpenAPI; higher-sample winner now recorded with x-also-served-from extension.

Files

  • SKILL.md / REFERENCE.md — skill docs, file format reference, jq recipes, troubleshooting
  • scripts/discover.mjs — top-level dispatcher with --stage for partial runs
  • scripts/{load,filter,normalize,infer,emit}.mjs — pipeline stages
  • scripts/lib/{io,redact,path-template,schema-merge,yaml}.mjs — pure helpers
  • BODY-CAPTURE-LIFT.md — design doc for adding native body capture to browser-trace (alternative to the current browse network on pairing). Open question for maintainers; no code change in this PR.

Test plan

  • Run end-to-end against a public site of your choice (e.g. your own marketing site or a public docs page) following the workflow in SKILL.md
  • Verify openapi.yaml parses with a YAML library (python -c "import yaml; yaml.safe_load(open('...'))")
  • Verify openapi.json parses (jq . openapi.json)
  • Confirm report.md correctly flags low-confidence endpoints
  • Try the --bodies flag with a browse network on capture and confirm response-body schemas appear in the spec

🤖 Generated with Claude Code


Note

Medium Risk
New end-to-end pipeline that processes potentially sensitive trace data (including optional request/response bodies) and emits schemas/specs; correctness and redaction behavior are important to avoid leaking secrets or producing misleading specs.

Overview
Adds a new browser-to-api skill that post-processes a browser-trace run into a best-effort OpenAPI 3.1 spec plus report.md, confidence.json, and per-endpoint redacted samples, implemented as a discover.mjs pipeline (load→filter→normalize→infer→emit).

The pipeline pairs CDP request/response events (optionally joining full bodies from browse network on via --bodies), filters noise via include/exclude/origin rules, templatizes paths (IDs + slug inference), infers JSON Schemas with redaction, hoists repeated schemas into components, and emits YAML/JSON using an in-repo YAML writer; docs (SKILL.md, REFERENCE.md) describe flags, file formats, and troubleshooting.

Reviewed by Cursor Bugbot for commit 9446f91. Bugbot is set up for automated code reviews on this repo. Configure here.

…aptures

Consumes a browser-trace run (.o11y/<run>/), pairs CDP request/response
events, templatizes paths, infers JSON schemas from samples, and emits an
OpenAPI 3.1 document with a coverage report and confidence metadata.

Pipeline: load → filter → normalize → infer → emit. Each stage is a
discrete script writing to intermediate/ for debuggability. Optional
--bodies <path> flag joins a `browse network on` capture by CDP requestId
so response bodies feed into schema inference.

E2E tested against Hacker News, jsonplaceholder, derekmeegan.com,
browserbase.com, browser-use.com, reddit.com.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread skills/browser-reverse/scripts/filter.mjs Outdated
Comment thread skills/browser-to-api/scripts/infer.mjs
Comment thread skills/browser-to-api/scripts/load.mjs
Copy link
Copy Markdown
Contributor

@shrey150 shrey150 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things to fix or address before rereviewing

Comment thread skills/browser-reverse/SKILL.md Outdated

```
browser-trace → .o11y/<run>/cdp/network/{requests,responses}.jsonl
discover-api-spec → .o11y/<run>/api-spec/openapi.yaml + report.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this skill actually defined anywhere in this PC?

Comment thread skills/browser-reverse/SKILL.md Outdated

`discover.mjs` auto-detects `<run>/cdp/network/bodies/`. To use a body capture from elsewhere (e.g. didn't snapshot, want the live `browse network` dir), pass `--bodies <path>` explicitly.

Then deliver the artifacts to the user (`exec.sendFile()` for `openapi.yaml` and `report.md`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exec.sendFile() is for bb not for general use right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was from a claude memory i think lol...

@@ -0,0 +1,118 @@
# Adding Response Body Capture to `browser-trace` — Lift Estimate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this plan mode slop lol

Comment thread skills/browser-reverse/package.json Outdated
@@ -0,0 +1,6 @@
{
"name": "browser-reverse",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like a different name, like /discover-api or /browser-to-api or /website-to-api

Comment thread skills/browser-reverse/REFERENCE.md Outdated
@@ -0,0 +1,240 @@
# Browser Reverse — Reference
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is what a REFERENCE.md file should be - it should exhaustively describe all commands used by the skill, I would maybe recommend removing the Pipeline portion here

Renaming and doc cleanup (per shrey150):
- Rename skill from `browser-reverse` to `browser-to-api`. Updates SKILL.md
  frontmatter + heading, package.json, REFERENCE.md heading, the OpenAPI
  doc's `info.description`, and the report.md heading.
- Fix the stale `discover-api-spec` reference in SKILL.md's composition diagram
  (left over from an earlier rename).
- Drop `BODY-CAPTURE-LIFT.md` from the PR; it's a separate proposal.
- Remove the `exec.sendFile()` reference in SKILL.md (browserbase-internal,
  not a generic skill primitive).
- REFERENCE.md restructured to lead with the script/CLI/file-format reference
  rather than an architecture intro. Pipeline diagram dropped.

Bug fixes (per Cursor Bugbot):
- `filter.mjs`: rework precedence so `--include` actually rescues URLs that
  would be hit by a default exclude, matching the documented contract. User
  `--exclude` still wins. Added a unit-style test path.
- `infer.mjs`: skip response-body samples whose CDP status is null. Previously
  they were keyed under `"0"` but `emit.mjs` only iterates `ep.statusCodes`
  (which excludes nulls), silently discarding the body.
- `load.mjs`: fix the comment in `urlQuery()` — code is first-value-wins, not
  last-value-wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@derekmeegan
Copy link
Copy Markdown
Contributor Author

Pushed 9446f91 addressing all review comments:

@shrey150

  • ✅ Renamed skill to browser-to-api. Updates frontmatter, heading, package.json, OpenAPI info.description, and report.md heading.
  • ✅ Fixed the stale discover-api-spec reference in SKILL.md (line 17 in your comment).
  • ✅ Removed BODY-CAPTURE-LIFT.md from this PR.
  • ✅ Removed the exec.sendFile() reference.
  • ✅ Restructured REFERENCE.md to lead with the script/CLI/file-format reference; dropped the architecture pipeline diagram.

@cursor[bot]

  • filter.mjs — reworked precedence so --include rescues URLs hit by default excludes (matches the documented contract). User --exclude still wins. Verified with an inline test against app.map (sourcemap default-exclude) being rescued by --include 'app\.map'.
  • infer.mjs — skip response-body samples whose CDP status is null instead of keying under "0" and having the data silently discarded by emit.
  • load.mjs — fixed the misleading "Last value wins" comment; code is first-value-wins (which is fine for our use — we only need parameter names + a representative value for type inference).

Branch name is still add-browser-reverse-skill (pre-rename) but the skill itself is browser-to-api everywhere.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9446f91. Configure here.


function isKeySecret(name) {
const k = String(name).toLowerCase().replace(/[_-]/g, '');
return KEY_DENY.has(k) || extraKeys.has(k);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra redaction keys silently fail for body matching

Medium Severity

isKeySecret normalizes the input name by stripping underscores and hyphens via .replace(/[_-]/g, ''), but extraKeys stores user-provided --redact values with only toLowerCase() applied — no underscore/hyphen stripping. A user passing --redact my_secret_key stores my_secret_key in extraKeys, but the lookup normalizes the JSON key to mysecretkey, which never matches. User-specified body key redactions containing _ or - are silently ignored, potentially leaking credentials the user explicitly asked to scrub.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9446f91. Configure here.

}
}
for (const [key, origins] of Object.entries(collisions)) {
const [m, p] = key.split(' ');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collision key split breaks on space-containing paths

Low Severity

The collision key is built as `${m} ${ep.path}` and later destructured via key.split(' ') into [m, p]. If a path ever contained a literal space, split(' ') would produce more than two elements and p would only capture the first path segment, causing paths[p][m] to be undefined and throwing at runtime. Using indexOf(' ') to split on the first space only would be more robust.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9446f91. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants