Skip to content

feat(browser): remove silent html truncation, add --as json#1102

Merged
jackwener merged 2 commits intomainfrom
feat/browser-get-html-json
Apr 20, 2026
Merged

feat(browser): remove silent html truncation, add --as json#1102
jackwener merged 2 commits intomainfrom
feat/browser-get-html-json

Conversation

@jackwener
Copy link
Copy Markdown
Owner

Summary

Per #opencli-browser discussion (follow-up to #1100), browser get html had two agent-hostile defaults:

  1. Silent 50000-char truncation. Agents got a half-DOM with no signal. On any page larger than ~50KB HTML (basically all of them), extraction logic silently saw partial data.
  2. Raw HTML string only. Agents had to re-parse to extract structure.

Changes

Behavior Before After
Default output outerHTML.slice(0, 50000), no marker Full outerHTML, no cap
Opt-in cap --max <n>; truncation prepends <!-- opencli: truncated N of M chars ... -->
Structured output --as json{selector, matched, tree} tree of {tag, attrs, text, children}
Selector mismatch stdout (empty) (indistinguishable from empty element) structured {error:{code:\"selector_not_found\"}} + exit 1
Bad --as / --max invalid_format / invalid_max error codes

--as json shape

{
  \"selector\": \".hero\",
  \"matched\": 1,
  \"tree\": {
    \"tag\": \"div\",
    \"attrs\": { \"class\": \"hero\", \"id\": \"x\" },
    \"text\": \"Hello\",
    \"children\": [
      { \"tag\": \"span\", \"attrs\": {}, \"text\": \"world\", \"children\": [] }
    ]
  }
}
  • matched is the full querySelectorAll count, not just 1 — agents know when more elements exist
  • text is direct text-children concatenated + whitespace-collapsed; element children carry their own text, ordering between text and elements is not preserved (agents who need ordering should use raw HTML mode)
  • attrs pass through untouched

Code layout

  • New src/browser/html-tree.ts — the tree JS expression, unit-tested against a DOM stub
  • src/cli.ts — command rewrite, structured error paths shared with browser network

Test plan

  • src/browser/html-tree.test.ts — serializer on simple elements, whitespace collapse in direct text, recursion with attrs, multi-match first-wins with matched count, zero-match null tree (5 tests)
  • src/cli.test.ts browser get html suite — full default, --max with truncation marker, negative --max, --as json envelope, selector_not_found (raw and json), bad --as format (7 tests)
  • npm test — 221 files, 1661 tests, 2 skipped, all green
  • npm run build — clean TypeScript build + manifest

jackwener and others added 2 commits April 21, 2026 01:43
`browser get html` had two agent-hostile defaults:

1. A silent 50000-char cap on the returned HTML — agents that got a
   truncated page had no signal they were looking at half the DOM.
2. Only raw HTML string output, forcing agents to re-parse for
   structured extraction.

Changes:

- Default output is now the full outerHTML, no truncation
- `--max <n>` opts in to a character cap; when the cap actually
  trips, the HTML is prepended with
  `<!-- opencli: truncated N of M chars; re-run without --max ... -->`
  so agents always see the signal
- `--as json` returns `{selector, matched, tree}` where `tree` is
  `{tag, attrs, text, children}` recursively. `matched` is the full
  count of selector matches so agents know when more elements exist
  beyond the first. `text` is the node's own direct text children,
  whitespace-collapsed; child elements live in `children`.
- `--selector` not matching any element now emits structured
  `{error:{code:"selector_not_found", ...}}` with a non-zero exit
  code, in both raw and json modes (was `(empty)` stdout previously,
  indistinguishable from empty element)
- Invalid `--as` / negative `--max` emit structured
  `invalid_format` / `invalid_max` error codes

Extracted the tree serializer as `src/browser/html-tree.ts` so the
JS expression can be unit-tested against a DOM stub.
…t --max

Both edges previously bypassed the structured-error contract introduced in
#1102, which agents rely on for branching:

- Invalid CSS selector: querySelector(All) would throw SyntaxError through
  page.evaluate into the generic exception path. Wrap the lookup in try/catch
  inside page context for both raw and --as json paths; surface as
  {error:{code:"invalid_selector", message}} + non-zero exit.

- --max validation: parseInt silently accepted "1.5" -> 1 and "10abc" -> 10.
  Switch to a strict /^\\d+$/ check so fractional, negative, and non-numeric
  values all return {error:{code:"invalid_max"}}; validation runs up front so
  bad values never reach the page.

Covered by new unit tests in cli.test.ts (fractional, non-numeric, invalid
selector on raw + json) and html-tree.test.ts (SyntaxError -> invalidSelector
envelope).

Co-authored-by: freemandealer <freeman.zhang1992@gmail.com>
@jackwener jackwener merged commit 6cf5cb2 into main Apr 20, 2026
13 checks passed
luxiaolei pushed a commit to luxiaolei/OpenCLI that referenced this pull request Apr 21, 2026
…r#1102)

* feat(browser): remove silent html truncation, add --as json tree output

`browser get html` had two agent-hostile defaults:

1. A silent 50000-char cap on the returned HTML — agents that got a
   truncated page had no signal they were looking at half the DOM.
2. Only raw HTML string output, forcing agents to re-parse for
   structured extraction.

Changes:

- Default output is now the full outerHTML, no truncation
- `--max <n>` opts in to a character cap; when the cap actually
  trips, the HTML is prepended with
  `<!-- opencli: truncated N of M chars; re-run without --max ... -->`
  so agents always see the signal
- `--as json` returns `{selector, matched, tree}` where `tree` is
  `{tag, attrs, text, children}` recursively. `matched` is the full
  count of selector matches so agents know when more elements exist
  beyond the first. `text` is the node's own direct text children,
  whitespace-collapsed; child elements live in `children`.
- `--selector` not matching any element now emits structured
  `{error:{code:"selector_not_found", ...}}` with a non-zero exit
  code, in both raw and json modes (was `(empty)` stdout previously,
  indistinguishable from empty element)
- Invalid `--as` / negative `--max` emit structured
  `invalid_format` / `invalid_max` error codes

Extracted the tree serializer as `src/browser/html-tree.ts` so the
JS expression can be unit-tested against a DOM stub.

* fix(browser get html): structured errors for invalid selector & strict --max

Both edges previously bypassed the structured-error contract introduced in
jackwener#1102, which agents rely on for branching:

- Invalid CSS selector: querySelector(All) would throw SyntaxError through
  page.evaluate into the generic exception path. Wrap the lookup in try/catch
  inside page context for both raw and --as json paths; surface as
  {error:{code:"invalid_selector", message}} + non-zero exit.

- --max validation: parseInt silently accepted "1.5" -> 1 and "10abc" -> 10.
  Switch to a strict /^\\d+$/ check so fractional, negative, and non-numeric
  values all return {error:{code:"invalid_max"}}; validation runs up front so
  bad values never reach the page.

Covered by new unit tests in cli.test.ts (fractional, non-numeric, invalid
selector on raw + json) and html-tree.test.ts (SyntaxError -> invalidSelector
envelope).

Co-authored-by: freemandealer <freeman.zhang1992@gmail.com>

---------

Co-authored-by: freemandealer <freeman.zhang1992@gmail.com>
(cherry picked from commit 6cf5cb2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant