Skip to content

Commit dfa18d5

Browse files
committed
Overhaul HTML API fuzzer oracle, invariants, and throughput
Oracle: - Use real fragment parsing (innerHTML on a context element) for the DOM oracle instead of wrapping input in a full document. - Render template content through a self-contained serialization round-trip instead of consulting WP_HTML_Tag_Processor, which is the system under test. - Render NUL and CR bytes raw and tolerate a differing line only when the spec scalar substitution exactly explains it, reported as oracle-tolerated with per-seed line accounting. Sort attributes by spec-scrubbed names so both sides order NUL-bearing names alike. - Tolerate the probed PHP/Lexbor bug that leaks U+000C into body text from pre-body insertion modes; the probe disables the tolerance when PHP fixes it. - Count oracle losses per class (parse errors, unsupported, tolerated). Invariants: - Normalize tree preservation: parsing normalize() output must yield the original tree, modulo the documented scalar substitutions. - Mutation differential: a data-fuzz attribute set on the first tag must parse identically in both parsers and change the WordPress tree by exactly one line, with reconstruction and dropped-element cases handled. - Bookmark/seek token-stream consistency on the Tag Processor. - Breadcrumb consistency against the token-derived element stack. Generation: - New select profile; explicit adoption-agency shapes; foreign-content breakouts, annotation-xml encoding variants, CDATA in foreign content; duplicate attributes; quirks doctypes; frameset documents; plaintext/noscript; special closers; auto-closing chains; named reference prefix-ambiguity cases. - Fragment-context dimension probing create_fragment() contexts, which picks up coverage automatically when non-body contexts land. - Corpus stage: ~20% of seeds deterministically mutate html5lib-tests inputs (byte, chunk, tag-name, case, and splice mutations). Pipeline: - Batch seeds into shared worker processes with per-seed isolation fallback, cutting per-seed process and bootstrap overhead (~50x). - Markup-aligned minimizer phase, no growing replacements, larger attempt budget; watcher retries failed minimizations. - Stop hashing input-derived facts into normalize signatures and cluster families on masked line content so one bug cannot mint unbounded signatures.
1 parent c076626 commit dfa18d5

17 files changed

Lines changed: 1849 additions & 315 deletions

tools/html-api-fuzz/README.md

Lines changed: 150 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,8 @@
22

33
Pure PHP fuzzer for the WordPress HTML API under `src/wp-includes/html-api`.
44
It compares `WP_HTML_Processor` against PHP's `Dom\HTMLDocument` oracle using an
5-
html5lib-style textual tree, and separately checks `WP_HTML_Tag_Processor`
6-
invariants, simple mutation behavior, and `WP_HTML_Processor::normalize()`
7-
idempotence.
5+
html5lib-style textual tree, and separately checks a set of API invariants
6+
described under “Invariants” below.
87

98
No browser, Playwright, Node, or `wp-env` is involved.
109

@@ -27,7 +26,8 @@ Run one generated seed with a specific terminal payload policy:
2726
php tools/html-api-fuzz/worker.php --seed 1 --payload-policy valid-utf8 --output-dir artifacts/html-api-fuzz/seed-1
2827
```
2928

30-
Run a batch in isolated worker subprocesses:
29+
Run a batch in worker subprocesses (seeds are batched into shared worker
30+
processes, 25 per process by default; see `--batch-size`):
3131

3232
```sh
3333
php tools/html-api-fuzz/runner.php --max-seeds 100 --duration-seconds 60
@@ -53,12 +53,6 @@ php tools/html-api-fuzz/launcher.php --lanes 4 --max-seeds 1000 --watcher
5353

5454
For continuous fuzzing, run the launcher with `--duration-seconds 0 --max-seeds 0`
5555
and run `watcher.php` in a second shell against the same output directory.
56-
The tmux helper, `tools/html-api-fuzz/start-continuous-run-tmux.sh`, starts this
57-
indefinite launcher/watcher/orchestrator setup. It does not currently support a
58-
graceful shutdown signal. A good follow-up feature would be a stop signal that
59-
lets each process finish its current unit of work (seed, watcher scan/minimize,
60-
or Codex turn) and then exit cleanly. For now, stop it by attaching to the tmux
61-
session and interrupting the panes, or by killing the tmux session.
6256

6357
Replay a failure:
6458

@@ -85,16 +79,50 @@ ceilings (`tag-token-limit-exceeded`, `mutation-token-limit-exceeded`,
8579
(`dom-node-limit-exceeded`). Process timeouts, PHP fatal errors, and memory
8680
failures are separate failures and are also in scope for triage.
8781

82+
## Execution Model
83+
84+
The runner batches consecutive seeds into one worker process
85+
(`worker.php --batch-count N`, default `--batch-size 25` on the runner) so the
86+
WordPress bootstrap and process spawn are paid once per batch rather than once
87+
per seed. Each seed still writes its own `seed-N/primary` artifacts. If a batch
88+
process dies or times out mid-way, seeds left without a `result.json` are
89+
re-run individually in isolation, so a crash on one input cannot take
90+
neighboring seeds' results with it.
91+
92+
## Input Stages
93+
94+
Seeds are deterministically split between two input stages:
95+
96+
- **Generated** (default ~80%): the structural grammar described under
97+
“Generator Profiles”.
98+
- **Corpus-mutated** (default ~20%, `--corpus-mutate-percent N` on
99+
`worker.php`/`runner.php`): a `#data` section from the html5lib-tests
100+
tree-construction corpus (`tests/phpunit/data/html5lib-tests`), passed
101+
through 1–4 deterministic mutations (byte insert/replace, chunk
102+
delete/duplicate, tag-name swap, case toggle, corpus splice). The stage,
103+
corpus file, entry index, and operations are recorded in result metadata,
104+
and the mutated input itself is in the replay manifest, so replays are
105+
standalone. Inputs report `inputSource: "corpus-mutated"` and
106+
`profile: "corpus-mutated"`.
107+
108+
Both stages derive entirely from the seed, so seed N always produces the same
109+
input for the same fuzzer version and corpus.
110+
88111
## Artifact Layout
89112

90113
The runner writes:
91114

92115
- `summary.ndjson`: one line per seed, suitable for tailing and watcher scans.
93-
- `events.ndjson`: runner lifecycle events.
116+
- `events.ndjson`: runner lifecycle events, including batch boundaries.
94117
- `state.json`: aggregate counters, stop reason, and compact Git metadata.
118+
Oracle losses are counted per class: `oracleParseErrors` (inputs the DOM
119+
oracle rejects receive no differential coverage), `oracleUnsupported`
120+
(template content the oracle cannot represent), and `oracleTolerated`
121+
(comparisons that passed only under the documented scalar tolerance).
95122
- `seed-N/primary/input.bin`: raw generated bytes.
96123
- `seed-N/primary/replay.json`: base64 replay manifest, including the commit
97-
hash and tracked-file dirty state needed to interpret a standalone replay.
124+
hash, tracked-file dirty state, and fragment context needed to interpret a
125+
standalone replay.
98126
- `seed-N/primary/result.json`: full worker result.
99127
- `seed-N/primary/wordpress-tree.txt` and `dom-tree.txt`: rendered trees when available.
100128

@@ -110,36 +138,99 @@ and preserve discovery provenance in `sourceReplay`.
110138

111139
The watcher writes triage state under `.triage-watcher` by default, or under
112140
`--state-dir` when provided. Each signature gets a stable directory containing
113-
`failure.json`, minimizer logs, and minimized replay/result artifacts.
141+
`failure.json`, minimizer logs, and minimized replay/result artifacts. Failed
142+
minimizations are retried on later scans, up to `--max-minimize-retries`
143+
(default 3) attempts per signature.
114144

115-
## Modes
145+
## Modes and Fragment Contexts
116146

117-
- `fragment-body`: parse as body children. The DOM oracle wraps the input in a
118-
minimal document and renders only body children.
147+
- `fragment-body`: parse as a fragment. The DOM oracle uses real fragment
148+
parsing (the `innerHTML` setter on a context element of an empty document),
149+
not a document-wrapping approximation.
119150
- `full-document`: parse as a full HTML document.
120151
- `auto`: weighted choice.
121152

153+
In fragment mode a context element is selected per seed
154+
(`--fragment-context TAG` on `worker.php` for replays). `<body>` dominates;
155+
the other contexts (`div`, `p`, `td`, `tr`, `table`, `caption`, `colgroup`,
156+
`select`, `option`, `template`, `title`, `textarea`, `script`, `style`,
157+
`svg`, `math`) receive a small probe weight. `WP_HTML_Processor::create_fragment()`
158+
currently supports only `<body>`, so non-body contexts are recorded as
159+
`status: "unsupported"` today; when create_fragment() gains context support
160+
the fuzzer picks up the new coverage with no changes. The DOM oracle already
161+
parses every context correctly.
162+
122163
Unsupported `WP_HTML_Processor` cases are expected by default and are recorded
123164
as successful attempts with `status: "unsupported"`. Use `--fail-unsupported`
124165
when you want unsupported cases to become failures.
125166

167+
## Invariants
168+
169+
Each seed checks, in order, stopping at the first failing class:
170+
171+
1. **Tag Processor invariants** (`tag-invariant-failed`): token loop
172+
termination under the token ceiling; non-null token type/name/tag;
173+
attribute getters and `class_list()` iteration do not throw;
174+
`get_updated_html()` with no queued edits returns the input unchanged; a
175+
simple `set_attribute()` mutation is visible to a re-scan; and
176+
**seek consistency** — a bookmark set at a seed-chosen token, after
177+
scanning to the end and seeking back, must reproduce the identical token
178+
stream (`seek-token-stream-mismatch`).
179+
2. **Differential tree comparison** (`tree-mismatch` / `encoding-mismatch`):
180+
the WordPress tree must equal the DOM oracle tree (see “Tree Comparison”).
181+
3. **Breadcrumb consistency** (`breadcrumb-mismatch`): at every tag token,
182+
`get_breadcrumbs()` must agree with the element stack derived from token
183+
order and `expects_closer()`.
184+
4. **Mutation differential** (`mutation-tree-mismatch` /
185+
`mutation-delta-mismatch`), only on a clean baseline: after setting
186+
`data-fuzz="1"` on the first tag, the mutated document must parse
187+
identically in WordPress and the DOM oracle, and the WordPress tree must
188+
change by exactly the one attribute line (unless formatting-element
189+
reconstruction clones the attribute, or tree construction legitimately
190+
drops the mutated element, in which case the differential comparison alone
191+
applies).
192+
5. **Normalize tree preservation** (`normalize-tree-changed`), only on a
193+
clean baseline: parsing `normalize()` output must produce the same tree as
194+
the original input, modulo the documented scalar substitutions. This is
195+
stricter than idempotence, which a consistently wrong serializer can pass.
196+
6. **Normalize idempotence** (`normalize-invariant-failed`):
197+
`normalize()` / `serialize()` run twice must be a fixed point, with no
198+
PHP native errors or throwables. Full documents use
199+
`create_full_parser()->serialize()`; non-body fragment contexts use
200+
`create_fragment(<context>)->serialize()`.
201+
126202
## Generator Profiles
127203

128204
The generator uses a structural HTML grammar with weighted profiles:
129205

130206
- `balanced`
131-
- `full-document`
207+
- `full-document` (includes occasional frameset documents, quirks-mode
208+
doctypes, and content after `</html>`)
132209
- `body-fragment`
133210
- `tables`
134211
- `template`
135-
- `foreign-content`
136-
- `rawtext-rcdata`
137-
- `formatting-adoption`
212+
- `select` (option/optgroup nesting, select-ending elements such as `input`,
213+
`textarea` and `button`, nested selects, select-in-table)
214+
- `foreign-content` (MathML/SVG integration points, HTML breakout tags,
215+
`<font>` with and without breakout attributes, `annotation-xml` encoding
216+
variants, CDATA sections in foreign content, case-mangled `foreignObject`)
217+
- `rawtext-rcdata` (script/style/iframe/noembed/noframes/xmp/noscript,
218+
title/textarea, occasional `plaintext`)
219+
- `formatting-adoption` (random formatting elements plus explicit
220+
adoption-agency shapes: misnested closers, block-boundary formatting,
221+
reconstruction across siblings, nested anchors, Noah's Ark overflow,
222+
repeated closers)
138223
- `attributes-entities`
139224
- `comments-doctype-bogus`
140225
- `deep-nesting`
141226
- `resource-stress`
142-
- `incomplete-malformed`
227+
- `incomplete-malformed` (includes spec-special closers such as `</br>` and
228+
`</p>`, stray closers, and `<image>`)
229+
230+
All profiles can emit duplicate attribute names (first-wins coverage),
231+
auto-closing chains (`li`, `dd`/`dt`, headings, `p`), and named character
232+
references with longest-prefix-match ambiguity (`&notit;`, `&copyright;`,
233+
`&ngE`, ...).
143234

144235
Terminal payloads are selected by a separate policy:
145236

@@ -165,27 +256,57 @@ metadata for direct inputs and are not selectable for generated runs.
165256
Replayed and minimized manifests keep immediate `inputSource` metadata separate
166257
from `originalGenerator` metadata.
167258

168-
The `attributes-entities` profile biases toward tokenizer boundary coverage:
169-
valid and invalid decimal and hexadecimal character references, variable leading
170-
zeros, mixed casing, known named character references with and without trailing
171-
semicolons, unusual attribute spacing, malformed attribute names, and unusual or
172-
invalid tag-name syntax.
173-
174259
## Tree Comparison
175260

176261
The tree renderer follows the html5lib test style used by
177262
`tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php`:
178263

179-
- sorted attributes
264+
- attributes sorted by their spec-scrubbed names (so a raw-NUL name on the
265+
WordPress side and its U+FFFD substitution on the DOM side sort identically),
266+
rendered raw
180267
- boolean attributes rendered as `=""`
181268
- namespace-qualified element and attribute names
182269
- template `content` marker
183270
- only the narrow auto-generated `html/head/body` wrapper tolerance
184271

185-
Invalid bytes are not normalized away. If WordPress and `Dom\HTMLDocument`
272+
Template content is rendered through a self-contained serialization
273+
round-trip: PHP hides template child nodes, so the oracle re-parses the
274+
template's `innerHTML` serialization in a body context and accepts the result
275+
only when re-serializing reproduces the source byte-for-byte. Content that
276+
cannot round-trip (table parts, foreign fragments) is quarantined as
277+
`oracle-unsupported`. This check never consults the WordPress HTML API, which
278+
is the system under test.
279+
280+
Raw bytes are rendered without normalization. The WordPress HTML API
281+
deliberately preserves NUL and CR bytes where spec-following parsers
282+
substitute U+FFFD and normalize newlines during input preprocessing, so the
283+
comparison tolerates a differing line only when that exact substitution
284+
explains the entire difference. Tolerated lines are reported per seed
285+
(`comparison.scalarToleratedLines`) and per run (`oracleTolerated`), and the
286+
result is classified `oracle-tolerated` rather than silently passed. Any
287+
difference beyond the substitution fails as usual, and the first-difference
288+
record points at the first *unexplained* line.
289+
290+
One known oracle bug is tolerated with a runtime probe: PHP's Lexbor parser
291+
fails to treat U+000C FORM FEED as ignorable whitespace in the pre-body
292+
insertion modes. When a full-document comparison fails, the input contains a
293+
form feed, and re-parsing with form feeds substituted by spaces makes the DOM
294+
oracle reproduce the WordPress tree exactly, the case is classified
295+
`oracle-tolerated` with `comparison.formFeedQuirk: true`. The probe disables
296+
the tolerance automatically when PHP fixes the bug.
297+
298+
Invalid bytes are never normalized away. If WordPress and `Dom\HTMLDocument`
186299
surface different byte sequences, the first-difference record includes bounded
187300
line previews, byte lengths, line hashes, the first differing byte offset, and
188301
hex previews, including a diff-window hex preview around the differing byte, so
189302
the mismatch remains inspectable even when JSON display substitutes replacement
190303
characters. Full comparison lines are kept out of `result.json` to avoid large
191304
artifacts from stress inputs.
305+
306+
## Minimization
307+
308+
`minimize.php` reduces in three phases under a shared attempt budget
309+
(`--max-attempts`, default 600): markup-aligned segment deletion, binary
310+
byte-chunk deletion, then per-byte deletion and canonicalization (replacements
311+
never grow the input). Every accepted candidate re-runs the worker and must
312+
reproduce the original signature hash (or any failure with `--any-failure`).

tools/html-api-fuzz/launcher.php

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -250,10 +250,12 @@ function html_api_fuzz_launcher_close_lane( array &$lane ): array {
250250
}
251251

252252
$aggregate = array(
253-
'successes' => 0,
254-
'failures' => 0,
255-
'unsupported' => 0,
256-
'oracleErrors' => 0,
253+
'successes' => 0,
254+
'failures' => 0,
255+
'unsupported' => 0,
256+
'oracleParseErrors' => 0,
257+
'oracleUnsupported' => 0,
258+
'oracleTolerated' => 0,
257259
);
258260
foreach ( $state['laneResults'] as $lane ) {
259261
$runner_state = $lane['runnerState'] ?? array();

tools/html-api-fuzz/lib/Corpus.php

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
<?php
2+
namespace HtmlApiFuzz;
3+
4+
/**
5+
* Seed corpus drawn from the html5lib-tests tree-construction suite. These
6+
* inputs encode decades of parser edge cases; mutating them explores
7+
* neighborhoods that the structural generator's grammar never reaches.
8+
*/
9+
class Corpus {
10+
private static $entries = null;
11+
12+
public static function default_directory(): string {
13+
return repo_root() . '/tests/phpunit/data/html5lib-tests/tree-construction';
14+
}
15+
16+
/**
17+
* Returns the corpus entries: every #data section from every .dat file,
18+
* sorted deterministically. Cached per process.
19+
*/
20+
public static function entries( ?string $directory = null ): array {
21+
if ( null === $directory && null !== self::$entries ) {
22+
return self::$entries;
23+
}
24+
25+
$dir = $directory ?? self::default_directory();
26+
$entries = array();
27+
$files = is_dir( $dir ) ? glob( $dir . '/*.dat' ) : false;
28+
$files = false === $files ? array() : $files;
29+
sort( $files );
30+
foreach ( $files as $file ) {
31+
$contents = file_get_contents( $file );
32+
if ( false === $contents ) {
33+
continue;
34+
}
35+
foreach ( self::parse_dat_data_sections( $contents ) as $data ) {
36+
$entries[] = array(
37+
'file' => basename( $file ),
38+
'data' => $data,
39+
);
40+
}
41+
}
42+
43+
if ( null === $directory ) {
44+
self::$entries = $entries;
45+
}
46+
return $entries;
47+
}
48+
49+
/**
50+
* Extracts #data sections from html5lib .dat content. A section runs from
51+
* the line after `#data` to the line before the next `#` directive, with
52+
* the trailing newline removed.
53+
*/
54+
private static function parse_dat_data_sections( string $contents ): array {
55+
$sections = array();
56+
$lines = explode( "\n", $contents );
57+
$current = null;
58+
foreach ( $lines as $line ) {
59+
if ( '#data' === $line ) {
60+
$current = array();
61+
continue;
62+
}
63+
if ( null !== $current ) {
64+
if ( '' !== $line && '#' === $line[0] ) {
65+
$sections[] = implode( "\n", $current );
66+
$current = null;
67+
continue;
68+
}
69+
$current[] = $line;
70+
}
71+
}
72+
if ( null !== $current && array() !== $current ) {
73+
$sections[] = implode( "\n", $current );
74+
}
75+
76+
return $sections;
77+
}
78+
}

0 commit comments

Comments
 (0)