Skip to content

Add browser-reverse skill — OpenAPI 3.1 from browser-trace captures#88

Merged
shrey150 merged 6 commits into
mainfrom
add-browser-reverse-skill
May 13, 2026
Merged

Add browser-reverse skill — OpenAPI 3.1 from browser-trace captures#88
shrey150 merged 6 commits into
mainfrom
add-browser-reverse-skill

Conversation

@derekmeegan
Copy link
Copy Markdown
Contributor

@derekmeegan derekmeegan commented Apr 29, 2026

Summary

browser-reverse consumes a browser-trace run directory and emits an OpenAPI 3.1 spec for the publicly-observable HTTP API of any website, plus a human-readable coverage report and per-endpoint confidence metadata. Pure offline post-processing — composes cleanly with the existing browser-trace skill rather than duplicating capture.

Pipeline (each stage is a discrete script for debuggability via --stage):

load → filter → normalize → infer → emit

Highlights

  • Path templating — UUIDs, integers, hex/base62 IDs, plus a second-pass slug detector for varying alpha segments. Multi-param paths get {id}, {id2}, etc.
  • Schema inference (lib/schema-merge.mjs) — JSON-Schema from samples with required-intersection, type unions, format hints (date-time, uri, email, uuid), and enum detection that requires meaningful repetition (not just low cardinality).
  • Component hoisting with $ref — recurses into nested object/array schemas, hoists when referenced ≥ 2 times OR when it's an object with ≥ 4 properties. Names derived from path tokens.
  • Redaction — credentials in headers (Authorization, Cookie, *-token, etc.), in body keys (password, apiKey, etc.), and value patterns (JWTs, emails, phone numbers). Replaces values with <redacted> to preserve types for inference.
  • browse network on integration — pass --bodies <path> (or stash bodies under <run>/cdp/network/bodies/, which is auto-detected) to join real response bodies into the trace by CDP requestId. Without it, the spec has request bodies but no response-body schemas (the browse cdp firehose doesn't embed bodies).
  • Cross-origin path collisions handled — when two origins serve the same (method, path), the higher-sample operation wins and other origins are recorded under x-also-served-from rather than silently dropped.
  • Honest reportingreport.md lists every endpoint with samples, statuses, confidence, and normalization flags (single-sample, single-status, mixed-content-types, divergent-response-shape, request-body-only-on-some-samples).

Composition with browser-trace

node ../browser-trace/scripts/start-capture.mjs 9222 my-site
browse env local 9222
browse network on                                 # capture bodies (recommended)
browse open https://example.com
# ...drive flows...
cp -r "$(browse network path | jq -r .path)" .o11y/my-site/cdp/network/bodies/
browse network off
node ../browser-trace/scripts/stop-capture.mjs my-site
node ../browser-trace/scripts/bisect-cdp.mjs my-site

node scripts/discover.mjs --run .o11y/my-site

End-to-end testing

Pipeline ran clean against six sites; five real bugs surfaced and fixed during this work:

Site Outcome
Hacker News 7 endpoints; query-param type inference (integer vs string on id based on values)
jsonplaceholder.typicode.com 6 endpoints; POST/PUT body schemas, multi-status 200+404, header + body redaction
derekmeegan.com (Next.js) 4 endpoints; _rsc query param, Vercel analytics body schema, mixed-content-types detection
browserbase.com 39 endpoints across 14 origins; multi-param path /pixel/{id}/visitor/{id2}/cerebro, 12 components hoisted
browser-use.com 23 endpoints; discovered /api/md/<slug> LLM-friendly markdown export endpoint
reddit.com 20 endpoints, 30 components, 2 servers; full schema for /svc/shreddit/events (Reddit's internal telemetry, 18 nested types), live ExposeVariant GraphQL exposure capturing experiment names

Bugs surfaced and fixed during E2E:

  1. Enum over-detection — required distinct ≤ floor(samples/2) so unique IDs don't become enums.
  2. Component hoisting silently disabled — {...}.length is undefined; rewrote to use Object.keys(...).length and recurse into nested schemas.
  3. Redaction double-counting — redactBody() called twice per body; redact once and reuse.
  4. YAML emitter producing invalid scalars — @, `, #, etc. as first character now trigger quoting (was breaking on @vercel/analytics/react).
  5. Cross-origin path collision data loss — paths.<path>.<method> is unique in OpenAPI; higher-sample winner now recorded with x-also-served-from extension.

Files

  • SKILL.md / REFERENCE.md — skill docs, file format reference, jq recipes, troubleshooting
  • scripts/discover.mjs — top-level dispatcher with --stage for partial runs
  • scripts/{load,filter,normalize,infer,emit}.mjs — pipeline stages
  • scripts/lib/{io,redact,path-template,schema-merge,yaml}.mjs — pure helpers
  • BODY-CAPTURE-LIFT.md — design doc for adding native body capture to browser-trace (alternative to the current browse network on pairing). Open question for maintainers; no code change in this PR.

Test plan

  • Run end-to-end against a public site of your choice (e.g. your own marketing site or a public docs page) following the workflow in SKILL.md
  • Verify openapi.yaml parses with a YAML library (python -c "import yaml; yaml.safe_load(open('...'))")
  • Verify openapi.json parses (jq . openapi.json)
  • Confirm report.md correctly flags low-confidence endpoints
  • Try the --bodies flag with a browse network on capture and confirm response-body schemas appear in the spec

🤖 Generated with Claude Code


Note

Medium Risk
Mostly additive (new browser-to-api skill), but it introduces a non-trivial inference/emission pipeline that could generate incorrect specs or leak sensitive data if redaction misses app-specific secrets.

Overview
Adds a new browser-to-api skill that post-processes a browser-trace run into a best-effort OpenAPI 3.1 spec plus artifacts (index.html report, report.md, confidence.json, and a generated client.mjs).

Implements a 5-stage Node (stdlib-only) pipeline (load/filter/normalize/infer/emit) including optional joining of browse network on request/response bodies by CDP requestId, URL templating + noise filtering, multiplexed endpoint decomposition (e.g., GraphQL operationName), schema inference with redaction, and OpenAPI emission with component schema hoisting and cross-origin collision handling.

Reviewed by Cursor Bugbot for commit cf3e72b. Bugbot is set up for automated code reviews on this repo. Configure here.

…aptures

Consumes a browser-trace run (.o11y/<run>/), pairs CDP request/response
events, templatizes paths, infers JSON schemas from samples, and emits an
OpenAPI 3.1 document with a coverage report and confidence metadata.

Pipeline: load → filter → normalize → infer → emit. Each stage is a
discrete script writing to intermediate/ for debuggability. Optional
--bodies <path> flag joins a `browse network on` capture by CDP requestId
so response bodies feed into schema inference.

E2E tested against Hacker News, jsonplaceholder, derekmeegan.com,
browserbase.com, browser-use.com, reddit.com.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread skills/browser-reverse/scripts/filter.mjs Outdated
Comment thread skills/browser-to-api/scripts/infer.mjs
Comment thread skills/browser-to-api/scripts/load.mjs
Copy link
Copy Markdown
Contributor

@shrey150 shrey150 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things to fix or address before rereviewing

Comment thread skills/browser-reverse/SKILL.md Outdated

```
browser-trace → .o11y/<run>/cdp/network/{requests,responses}.jsonl
discover-api-spec → .o11y/<run>/api-spec/openapi.yaml + report.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this skill actually defined anywhere in this PC?

Comment thread skills/browser-reverse/SKILL.md Outdated

`discover.mjs` auto-detects `<run>/cdp/network/bodies/`. To use a body capture from elsewhere (e.g. didn't snapshot, want the live `browse network` dir), pass `--bodies <path>` explicitly.

Then deliver the artifacts to the user (`exec.sendFile()` for `openapi.yaml` and `report.md`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exec.sendFile() is for bb not for general use right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was from a claude memory i think lol...

@@ -0,0 +1,118 @@
# Adding Response Body Capture to `browser-trace` — Lift Estimate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this plan mode slop lol

Comment thread skills/browser-reverse/package.json Outdated
@@ -0,0 +1,6 @@
{
"name": "browser-reverse",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like a different name, like /discover-api or /browser-to-api or /website-to-api

Comment thread skills/browser-reverse/REFERENCE.md Outdated
@@ -0,0 +1,240 @@
# Browser Reverse — Reference
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is what a REFERENCE.md file should be - it should exhaustively describe all commands used by the skill, I would maybe recommend removing the Pipeline portion here

Renaming and doc cleanup (per shrey150):
- Rename skill from `browser-reverse` to `browser-to-api`. Updates SKILL.md
  frontmatter + heading, package.json, REFERENCE.md heading, the OpenAPI
  doc's `info.description`, and the report.md heading.
- Fix the stale `discover-api-spec` reference in SKILL.md's composition diagram
  (left over from an earlier rename).
- Drop `BODY-CAPTURE-LIFT.md` from the PR; it's a separate proposal.
- Remove the `exec.sendFile()` reference in SKILL.md (browserbase-internal,
  not a generic skill primitive).
- REFERENCE.md restructured to lead with the script/CLI/file-format reference
  rather than an architecture intro. Pipeline diagram dropped.

Bug fixes (per Cursor Bugbot):
- `filter.mjs`: rework precedence so `--include` actually rescues URLs that
  would be hit by a default exclude, matching the documented contract. User
  `--exclude` still wins. Added a unit-style test path.
- `infer.mjs`: skip response-body samples whose CDP status is null. Previously
  they were keyed under `"0"` but `emit.mjs` only iterates `ep.statusCodes`
  (which excludes nulls), silently discarding the body.
- `load.mjs`: fix the comment in `urlQuery()` — code is first-value-wins, not
  last-value-wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@derekmeegan
Copy link
Copy Markdown
Contributor Author

Pushed 9446f91 addressing all review comments:

@shrey150

  • ✅ Renamed skill to browser-to-api. Updates frontmatter, heading, package.json, OpenAPI info.description, and report.md heading.
  • ✅ Fixed the stale discover-api-spec reference in SKILL.md (line 17 in your comment).
  • ✅ Removed BODY-CAPTURE-LIFT.md from this PR.
  • ✅ Removed the exec.sendFile() reference.
  • ✅ Restructured REFERENCE.md to lead with the script/CLI/file-format reference; dropped the architecture pipeline diagram.

@cursor[bot]

  • filter.mjs — reworked precedence so --include rescues URLs hit by default excludes (matches the documented contract). User --exclude still wins. Verified with an inline test against app.map (sourcemap default-exclude) being rescued by --include 'app\.map'.
  • infer.mjs — skip response-body samples whose CDP status is null instead of keying under "0" and having the data silently discarded by emit.
  • load.mjs — fixed the misleading "Last value wins" comment; code is first-value-wins (which is fine for our use — we only need parameter names + a representative value for type inference).

Branch name is still add-browser-reverse-skill (pre-rename) but the skill itself is browser-to-api everywhere.


function isKeySecret(name) {
const k = String(name).toLowerCase().replace(/[_-]/g, '');
return KEY_DENY.has(k) || extraKeys.has(k);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra redaction keys silently fail for body matching

Medium Severity

isKeySecret normalizes the input name by stripping underscores and hyphens via .replace(/[_-]/g, ''), but extraKeys stores user-provided --redact values with only toLowerCase() applied — no underscore/hyphen stripping. A user passing --redact my_secret_key stores my_secret_key in extraKeys, but the lookup normalizes the JSON key to mysecretkey, which never matches. User-specified body key redactions containing _ or - are silently ignored, potentially leaking credentials the user explicitly asked to scrub.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9446f91. Configure here.

Comment thread skills/browser-to-api/scripts/emit.mjs Outdated
shrey150 and others added 2 commits May 12, 2026 16:53
normalize.mjs:
- Auto-classify endpoints as api/noise/page and drop non-API traffic
  (tracking, analytics, bot defense, session plumbing, HTML page renders)
- Detect multiplexed endpoints (GraphQL operationName, JSON-RPC method,
  query param dispatch) and decompose into separate logical operations
- Typically drops 60-80% of captured traffic as noise

emit.mjs:
- Generate client.mjs — zero-dependency ES module wrapping each discovered
  operation as an async function with JSDoc param types
- For GraphQL/APQ endpoints, embeds persisted query hashes and wires up
  the full request shape so callers just pass variables
- Extract required headers from trace (CSRF tokens, custom headers) and
  include them in client defaults
- Task-oriented report.md with quick-start import, curl examples,
  variables tables, and response samples per operation

On OpenTable trace: 27 raw endpoints → 9 named operations, zero noise.
Generated client with autocomplete(), restaurantsAvailability(), etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
const NOISE_PATH_PATTERNS = [
// Tracking / analytics / telemetry
/\/track(ing)?[\/\b]/i, /\/pixel/i, /\/beacon/i, /\/log[\/\b]/i,
/\/impression/i, /\/pageview/i, /\/click[\/\b]/i,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex \b in character class matches backspace, not word boundary

Medium Severity

Several NOISE_PATH_PATTERNS use [\/\b] intending to match a slash or word boundary. However, \b inside a character class [...] matches the backspace character (U+0008), not a word boundary assertion. This affects patterns like /\/track(ing)?[\/\b]/i, /\/log[\/\b]/i, /\/click[\/\b]/i, and /\/experiment[\/\b]/i. Paths ending with these segments (e.g., /api/track, /api/log) without a trailing slash will not be classified as noise and will leak through into the discovered spec.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dc07d29. Configure here.

if (hash) lines.push(` ${op.operationName}: '${hash}',`);
}
lines.push(`};\n`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HASHES constant redeclared for multiple persisted-query endpoints

Medium Severity

When multiple GraphQL parent paths use persisted queries, the generated client.mjs emits const HASHES = {...} once per parent path. Since all declarations are at the module's top-level scope, the second const HASHES declaration produces a JavaScript SyntaxError, making the entire generated client unusable. The variable name needs to be unique per parent path (e.g., incorporating the path or a counter).

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dc07d29. Configure here.


// Regular REST endpoints
for (const ep of regular) {
const fnName = makeOpId(ep).replace(/^(get|post|put|patch|delete)_/, (_, m) => m);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace callback keeps method name instead of stripping prefix

Low Severity

The replace callback (_, m) => m returns the captured HTTP method name (get, post, etc.) instead of an empty string. This replaces get_ with get (only removing the underscore), producing function names like getv1_items_id instead of the intended v1_items_id. The replacement string should be '' to properly strip the method prefix.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dc07d29. Configure here.

lines.push(` 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36',`);
for (const [k, v] of Object.entries(observedHeaders)) {
lines.push(` '${k}': '${v}',`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unescaped dynamic values injected into generated JavaScript source

Medium Severity

Header values from HTTP traces are interpolated directly into single-quoted JavaScript string literals in the generated client.mjs without escaping. If any observed header value contains a single quote (or backslash, newline, etc.), the generated code will have a syntax error and fail to parse. Values need to be escaped, e.g., using JSON.stringify().

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dc07d29. Configure here.

shrey150 and others added 2 commits May 13, 2026 17:31
Generates index.html with:
- Summary stats (operations, endpoint, protocol, sample count)
- Expandable cards per operation with variables table, client usage,
  request body, and response example
- Full generated client.mjs embedded at the bottom

The Swagger UI was a poor fit — 10 identical green POST bars for a
single GraphQL endpoint with bracket-syntax paths that aren't even
valid OpenAPI. The HTML report shows what actually matters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
emit.mjs already generates index.html as the primary visual output —
update SKILL.md to match and remove the dead open-swagger-ui.mjs script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 8 total unresolved issues (including 5 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5eee2ab. Configure here.

<div class="card" id="op-${i}">
<div class="card-header" onclick="this.parentElement.classList.toggle('open')">
<div class="card-title">
<span class="method">POST</span>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTML report hardcodes "POST" for every endpoint card

High Severity

The method badge in buildHtmlReport is hardcoded to POST for every endpoint card, regardless of the actual HTTP method. ep.method is available and used correctly on line 659, but line 650 emits a static POST string. Every GET, PUT, DELETE, and PATCH endpoint in the generated index.html will incorrectly display as POST.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5eee2ab. Configure here.

writeJson(path.join(outDir, 'confidence.json'), confidence);

// report.md
const redaction = readJson(intermediatePath(outDir, 'redaction-stats.json'), { headers: 0, bodyKeys: 0, bodyValues: 0 });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redaction stats loaded and passed but never used

Low Severity

The redaction variable is read from disk via readJson on line 297 and passed to buildReport, where it's destructured as a parameter but never referenced in the function body. This is dead code — the redaction statistics (header count, body key count, body value count) are computed and persisted but never displayed in the report.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5eee2ab. Configure here.


// Mixed types — fall back to a typed union via "type" array (OpenAPI 3.1 / draft 2020-12 OK).
const out = { type: nullable ? [...nonNull, 'null'] : nonNull };
return out;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All-null fields produce invalid empty type array schema

Medium Severity

When all samples for a field are null, toSchema falls through every branch to the mixed-types case and produces { type: [] } — an empty type array. This is because nonNull is empty and nullable is false (it requires nonNull.length > 0). The correct output for an always-null field is { type: 'null' }. This produces an invalid JSON Schema fragment in the emitted OpenAPI spec.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5eee2ab. Configure here.

@shrey150 shrey150 self-requested a review May 13, 2026 22:33
@shrey150 shrey150 merged commit e338848 into main May 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants