Skip to content

Commit faf9cef

Browse files
dmsnelldlh01sirreal
committed
HTML API: Avoid processing incomplete syntax elements.
The HTML Tag Processor is able to know if it starts parsing a syntax element and reaches the end of the document before it reaches the end of the element. In these cases, after this patch, the processor will indicate this condition. For example, when processing `<div><input type="te` there is an incomplete INPUT element. The processor will fail to find the INPUT, it will pause right after the DIV, and `paused_at_incomplete_token()` will return `true`. This patch doesn't change any existing behaviors, but it adds the new method to report on the final failure condition. It provides a mechanism for later use to add chunked parsing to the class, wherein it will be possible to process a document without having the entire document loaded in memory, for example when processing unbuffered output. This is also a necessary change for adding the ability to scan every token in the document. Currently the Tag Processor only exposes tags as tokens, but it will need to process `#text` nodes, HTML comments, and other markup in order to enable behaviors in the HTML Processor and in refactors of existing HTML processing in Core. Co-authored-by: David Herrera <mail@dlh01.info> Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com>
1 parent 6daf853 commit faf9cef

5 files changed

Lines changed: 1578 additions & 145 deletions

File tree

src/wp-includes/html-api/class-wp-html-processor.php

Lines changed: 27 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -149,17 +149,6 @@ class WP_HTML_Processor extends WP_HTML_Tag_Processor {
149149
*/
150150
const MAX_BOOKMARKS = 100;
151151

152-
/**
153-
* Static query for instructing the Tag Processor to visit every token.
154-
*
155-
* @access private
156-
*
157-
* @since 6.4.0
158-
*
159-
* @var array
160-
*/
161-
const VISIT_EVERYTHING = array( 'tag_closers' => 'visit' );
162-
163152
/**
164153
* Holds the working state of the parser, including the stack of
165154
* open elements and the stack of active formatting elements.
@@ -424,6 +413,30 @@ public function next_tag( $query = null ) {
424413
return false;
425414
}
426415

416+
/**
417+
* Ensures internal accounting is maintained for HTML semantic rules while
418+
* the underlying Tag Processor class is seeking to a bookmark.
419+
*
420+
* This doesn't currently have a way to represent non-tags and doesn't process
421+
* semantic rules for text nodes. For access to the raw tokens consider using
422+
* WP_HTML_Tag_Processor instead.
423+
*
424+
* @since 6.5.0 Added for internal support; do not use.
425+
*
426+
* @access private
427+
*
428+
* @return bool
429+
*/
430+
public function next_token() {
431+
$found_a_token = parent::next_token();
432+
433+
if ( '#tag' === $this->get_token_type() ) {
434+
$this->step( self::REPROCESS_CURRENT_NODE );
435+
}
436+
437+
return $found_a_token;
438+
}
439+
427440
/**
428441
* Indicates if the currently-matched tag matches the given breadcrumbs.
429442
*
@@ -520,7 +533,9 @@ public function step( $node_to_process = self::PROCESS_NEXT_NODE ) {
520533
$this->state->stack_of_open_elements->pop();
521534
}
522535

523-
parent::next_tag( self::VISIT_EVERYTHING );
536+
while ( parent::next_token() && '#tag' !== $this->get_token_type() ) {
537+
continue;
538+
}
524539
}
525540

526541
// Finish stepping when there are no more tokens in the document.

0 commit comments

Comments
 (0)