22
33Pure PHP fuzzer for the WordPress HTML API under ` src/wp-includes/html-api ` .
44It compares ` WP_HTML_Processor ` against PHP's ` Dom\HTMLDocument ` oracle using an
5- html5lib-style textual tree, and separately checks ` WP_HTML_Tag_Processor `
6- invariants, simple mutation behavior, and ` WP_HTML_Processor::normalize() `
7- idempotence.
5+ html5lib-style textual tree, and separately checks a set of API invariants
6+ described under “Invariants” below.
87
98No browser, Playwright, Node, or ` wp-env ` is involved.
109
@@ -27,7 +26,8 @@ Run one generated seed with a specific terminal payload policy:
2726php tools/html-api-fuzz/worker.php --seed 1 --payload-policy valid-utf8 --output-dir artifacts/html-api-fuzz/seed-1
2827```
2928
30- Run a batch in isolated worker subprocesses:
29+ Run a batch in worker subprocesses (seeds are batched into shared worker
30+ processes, 25 per process by default; see ` --batch-size ` ):
3131
3232``` sh
3333php tools/html-api-fuzz/runner.php --max-seeds 100 --duration-seconds 60
@@ -53,12 +53,6 @@ php tools/html-api-fuzz/launcher.php --lanes 4 --max-seeds 1000 --watcher
5353
5454For continuous fuzzing, run the launcher with ` --duration-seconds 0 --max-seeds 0 `
5555and run ` watcher.php ` in a second shell against the same output directory.
56- The tmux helper, ` tools/html-api-fuzz/start-continuous-run-tmux.sh ` , starts this
57- indefinite launcher/watcher/orchestrator setup. It does not currently support a
58- graceful shutdown signal. A good follow-up feature would be a stop signal that
59- lets each process finish its current unit of work (seed, watcher scan/minimize,
60- or Codex turn) and then exit cleanly. For now, stop it by attaching to the tmux
61- session and interrupting the panes, or by killing the tmux session.
6256
6357Replay a failure:
6458
@@ -85,16 +79,50 @@ ceilings (`tag-token-limit-exceeded`, `mutation-token-limit-exceeded`,
8579(` dom-node-limit-exceeded ` ). Process timeouts, PHP fatal errors, and memory
8680failures are separate failures and are also in scope for triage.
8781
82+ ## Execution Model
83+
84+ The runner batches consecutive seeds into one worker process
85+ (` worker.php --batch-count N ` , default ` --batch-size 25 ` on the runner) so the
86+ WordPress bootstrap and process spawn are paid once per batch rather than once
87+ per seed. Each seed still writes its own ` seed-N/primary ` artifacts. If a batch
88+ process dies or times out mid-way, seeds left without a ` result.json ` are
89+ re-run individually in isolation, so a crash on one input cannot take
90+ neighboring seeds' results with it.
91+
92+ ## Input Stages
93+
94+ Seeds are deterministically split between two input stages:
95+
96+ - ** Generated** (default ~ 80%): the structural grammar described under
97+ “Generator Profiles”.
98+ - ** Corpus-mutated** (default ~ 20%, ` --corpus-mutate-percent N ` on
99+ ` worker.php ` /` runner.php ` ): a ` #data ` section from the html5lib-tests
100+ tree-construction corpus (` tests/phpunit/data/html5lib-tests ` ), passed
101+ through 1–4 deterministic mutations (byte insert/replace, chunk
102+ delete/duplicate, tag-name swap, case toggle, corpus splice). The stage,
103+ corpus file, entry index, and operations are recorded in result metadata,
104+ and the mutated input itself is in the replay manifest, so replays are
105+ standalone. Inputs report ` inputSource: "corpus-mutated" ` and
106+ ` profile: "corpus-mutated" ` .
107+
108+ Both stages derive entirely from the seed, so seed N always produces the same
109+ input for the same fuzzer version and corpus.
110+
88111## Artifact Layout
89112
90113The runner writes:
91114
92115- ` summary.ndjson ` : one line per seed, suitable for tailing and watcher scans.
93- - ` events.ndjson ` : runner lifecycle events.
116+ - ` events.ndjson ` : runner lifecycle events, including batch boundaries .
94117- ` state.json ` : aggregate counters, stop reason, and compact Git metadata.
118+ Oracle losses are counted per class: ` oracleParseErrors ` (inputs the DOM
119+ oracle rejects receive no differential coverage), ` oracleUnsupported `
120+ (template content the oracle cannot represent), and ` oracleTolerated `
121+ (comparisons that passed only under the documented scalar tolerance).
95122- ` seed-N/primary/input.bin ` : raw generated bytes.
96123- ` seed-N/primary/replay.json ` : base64 replay manifest, including the commit
97- hash and tracked-file dirty state needed to interpret a standalone replay.
124+ hash, tracked-file dirty state, and fragment context needed to interpret a
125+ standalone replay.
98126- ` seed-N/primary/result.json ` : full worker result.
99127- ` seed-N/primary/wordpress-tree.txt ` and ` dom-tree.txt ` : rendered trees when available.
100128
@@ -110,36 +138,99 @@ and preserve discovery provenance in `sourceReplay`.
110138
111139The watcher writes triage state under ` .triage-watcher ` by default, or under
112140` --state-dir ` when provided. Each signature gets a stable directory containing
113- ` failure.json ` , minimizer logs, and minimized replay/result artifacts.
141+ ` failure.json ` , minimizer logs, and minimized replay/result artifacts. Failed
142+ minimizations are retried on later scans, up to ` --max-minimize-retries `
143+ (default 3) attempts per signature.
114144
115- ## Modes
145+ ## Modes and Fragment Contexts
116146
117- - ` fragment-body ` : parse as body children. The DOM oracle wraps the input in a
118- minimal document and renders only body children.
147+ - ` fragment-body ` : parse as a fragment. The DOM oracle uses real fragment
148+ parsing (the ` innerHTML ` setter on a context element of an empty document),
149+ not a document-wrapping approximation.
119150- ` full-document ` : parse as a full HTML document.
120151- ` auto ` : weighted choice.
121152
153+ In fragment mode a context element is selected per seed
154+ (` --fragment-context TAG ` on ` worker.php ` for replays). ` <body> ` dominates;
155+ the other contexts (` div ` , ` p ` , ` td ` , ` tr ` , ` table ` , ` caption ` , ` colgroup ` ,
156+ ` select ` , ` option ` , ` template ` , ` title ` , ` textarea ` , ` script ` , ` style ` ,
157+ ` svg ` , ` math ` ) receive a small probe weight. ` WP_HTML_Processor::create_fragment() `
158+ currently supports only ` <body> ` , so non-body contexts are recorded as
159+ ` status: "unsupported" ` today; when create_fragment() gains context support
160+ the fuzzer picks up the new coverage with no changes. The DOM oracle already
161+ parses every context correctly.
162+
122163Unsupported ` WP_HTML_Processor ` cases are expected by default and are recorded
123164as successful attempts with ` status: "unsupported" ` . Use ` --fail-unsupported `
124165when you want unsupported cases to become failures.
125166
167+ ## Invariants
168+
169+ Each seed checks, in order, stopping at the first failing class:
170+
171+ 1 . ** Tag Processor invariants** (` tag-invariant-failed ` ): token loop
172+ termination under the token ceiling; non-null token type/name/tag;
173+ attribute getters and ` class_list() ` iteration do not throw;
174+ ` get_updated_html() ` with no queued edits returns the input unchanged; a
175+ simple ` set_attribute() ` mutation is visible to a re-scan; and
176+ ** seek consistency** — a bookmark set at a seed-chosen token, after
177+ scanning to the end and seeking back, must reproduce the identical token
178+ stream (` seek-token-stream-mismatch ` ).
179+ 2 . ** Differential tree comparison** (` tree-mismatch ` / ` encoding-mismatch ` ):
180+ the WordPress tree must equal the DOM oracle tree (see “Tree Comparison”).
181+ 3 . ** Breadcrumb consistency** (` breadcrumb-mismatch ` ): at every tag token,
182+ ` get_breadcrumbs() ` must agree with the element stack derived from token
183+ order and ` expects_closer() ` .
184+ 4 . ** Mutation differential** (` mutation-tree-mismatch ` /
185+ ` mutation-delta-mismatch ` ), only on a clean baseline: after setting
186+ ` data-fuzz="1" ` on the first tag, the mutated document must parse
187+ identically in WordPress and the DOM oracle, and the WordPress tree must
188+ change by exactly the one attribute line (unless formatting-element
189+ reconstruction clones the attribute, or tree construction legitimately
190+ drops the mutated element, in which case the differential comparison alone
191+ applies).
192+ 5 . ** Normalize tree preservation** (` normalize-tree-changed ` ), only on a
193+ clean baseline: parsing ` normalize() ` output must produce the same tree as
194+ the original input, modulo the documented scalar substitutions. This is
195+ stricter than idempotence, which a consistently wrong serializer can pass.
196+ 6 . ** Normalize idempotence** (` normalize-invariant-failed ` ):
197+ ` normalize() ` / ` serialize() ` run twice must be a fixed point, with no
198+ PHP native errors or throwables. Full documents use
199+ ` create_full_parser()->serialize() ` ; non-body fragment contexts use
200+ ` create_fragment(<context>)->serialize() ` .
201+
126202## Generator Profiles
127203
128204The generator uses a structural HTML grammar with weighted profiles:
129205
130206- ` balanced `
131- - ` full-document `
207+ - ` full-document ` (includes occasional frameset documents, quirks-mode
208+ doctypes, and content after ` </html> ` )
132209- ` body-fragment `
133210- ` tables `
134211- ` template `
135- - ` foreign-content `
136- - ` rawtext-rcdata `
137- - ` formatting-adoption `
212+ - ` select ` (option/optgroup nesting, select-ending elements such as ` input ` ,
213+ ` textarea ` and ` button ` , nested selects, select-in-table)
214+ - ` foreign-content ` (MathML/SVG integration points, HTML breakout tags,
215+ ` <font> ` with and without breakout attributes, ` annotation-xml ` encoding
216+ variants, CDATA sections in foreign content, case-mangled ` foreignObject ` )
217+ - ` rawtext-rcdata ` (script/style/iframe/noembed/noframes/xmp/noscript,
218+ title/textarea, occasional ` plaintext ` )
219+ - ` formatting-adoption ` (random formatting elements plus explicit
220+ adoption-agency shapes: misnested closers, block-boundary formatting,
221+ reconstruction across siblings, nested anchors, Noah's Ark overflow,
222+ repeated closers)
138223- ` attributes-entities `
139224- ` comments-doctype-bogus `
140225- ` deep-nesting `
141226- ` resource-stress `
142- - ` incomplete-malformed `
227+ - ` incomplete-malformed ` (includes spec-special closers such as ` </br> ` and
228+ ` </p> ` , stray closers, and ` <image> ` )
229+
230+ All profiles can emit duplicate attribute names (first-wins coverage),
231+ auto-closing chains (` li ` , ` dd ` /` dt ` , headings, ` p ` ), and named character
232+ references with longest-prefix-match ambiguity (` ¬it; ` , ` ©right; ` ,
233+ ` &ngE ` , ...).
143234
144235Terminal payloads are selected by a separate policy:
145236
@@ -165,27 +256,57 @@ metadata for direct inputs and are not selectable for generated runs.
165256Replayed and minimized manifests keep immediate ` inputSource ` metadata separate
166257from ` originalGenerator ` metadata.
167258
168- The ` attributes-entities ` profile biases toward tokenizer boundary coverage:
169- valid and invalid decimal and hexadecimal character references, variable leading
170- zeros, mixed casing, known named character references with and without trailing
171- semicolons, unusual attribute spacing, malformed attribute names, and unusual or
172- invalid tag-name syntax.
173-
174259## Tree Comparison
175260
176261The tree renderer follows the html5lib test style used by
177262` tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php ` :
178263
179- - sorted attributes
264+ - attributes sorted by their spec-scrubbed names (so a raw-NUL name on the
265+ WordPress side and its U+FFFD substitution on the DOM side sort identically),
266+ rendered raw
180267- boolean attributes rendered as ` ="" `
181268- namespace-qualified element and attribute names
182269- template ` content ` marker
183270- only the narrow auto-generated ` html/head/body ` wrapper tolerance
184271
185- Invalid bytes are not normalized away. If WordPress and ` Dom\HTMLDocument `
272+ Template content is rendered through a self-contained serialization
273+ round-trip: PHP hides template child nodes, so the oracle re-parses the
274+ template's ` innerHTML ` serialization in a body context and accepts the result
275+ only when re-serializing reproduces the source byte-for-byte. Content that
276+ cannot round-trip (table parts, foreign fragments) is quarantined as
277+ ` oracle-unsupported ` . This check never consults the WordPress HTML API, which
278+ is the system under test.
279+
280+ Raw bytes are rendered without normalization. The WordPress HTML API
281+ deliberately preserves NUL and CR bytes where spec-following parsers
282+ substitute U+FFFD and normalize newlines during input preprocessing, so the
283+ comparison tolerates a differing line only when that exact substitution
284+ explains the entire difference. Tolerated lines are reported per seed
285+ (` comparison.scalarToleratedLines ` ) and per run (` oracleTolerated ` ), and the
286+ result is classified ` oracle-tolerated ` rather than silently passed. Any
287+ difference beyond the substitution fails as usual, and the first-difference
288+ record points at the first * unexplained* line.
289+
290+ One known oracle bug is tolerated with a runtime probe: PHP's Lexbor parser
291+ fails to treat U+000C FORM FEED as ignorable whitespace in the pre-body
292+ insertion modes. When a full-document comparison fails, the input contains a
293+ form feed, and re-parsing with form feeds substituted by spaces makes the DOM
294+ oracle reproduce the WordPress tree exactly, the case is classified
295+ ` oracle-tolerated ` with ` comparison.formFeedQuirk: true ` . The probe disables
296+ the tolerance automatically when PHP fixes the bug.
297+
298+ Invalid bytes are never normalized away. If WordPress and ` Dom\HTMLDocument `
186299surface different byte sequences, the first-difference record includes bounded
187300line previews, byte lengths, line hashes, the first differing byte offset, and
188301hex previews, including a diff-window hex preview around the differing byte, so
189302the mismatch remains inspectable even when JSON display substitutes replacement
190303characters. Full comparison lines are kept out of ` result.json ` to avoid large
191304artifacts from stress inputs.
305+
306+ ## Minimization
307+
308+ ` minimize.php ` reduces in three phases under a shared attempt budget
309+ (` --max-attempts ` , default 600): markup-aligned segment deletion, binary
310+ byte-chunk deletion, then per-byte deletion and canonicalization (replacements
311+ never grow the input). Every accepted candidate re-runs the worker and must
312+ reproduce the original signature hash (or any failure with ` --any-failure ` ).
0 commit comments