313313 * - Comment nodes and nodes that became comments because of some syntax error. The
314314 * text for these nodes is the portion of the comment inside of the syntax. E.g. for
315315 * `<!-- comment -->` the text is `" comment "` (note that the spaces are part of it).
316- * For `<![CDATA[some content]]>` the text is `"[CDATA[some content]]"`.
316+ * - `CDATA` sections, whose text is the content inside of the section itself. E.g. for
317+ * `<![CDATA[some content]]>` the text is `"some content"`.
317318 * - "Funky comments," which are a special case of invalid closing tags whose name is
318319 * invalid. The text for these nodes is the text that a browser would transform into
319320 * an HTML when parsing. E.g. for `</%post_author>` the text is `%post_author`.
320321 *
321322 * And there are non-elements which are atomic in nature but have no modifiable text.
322323 * - `DOCTYPE` nodes like `<DOCTYPE html>` which have no closing tag.
324+ * - XML Processing instruction nodes like `<?xml charset="utf8"?>` (with restrictions).}
323325 * - The empty end tag `</>` which is ignored in the browser and DOM but exposed
324326 * to the HTML API.
325327 *
@@ -480,6 +482,8 @@ class WP_HTML_Tag_Processor {
480482 * | *Incomplete* | The HTML ended in the middle of a token; nothing more can be parsed. |
481483 * | *Matched tag* | Found an HTML tag; it's possible to modify its attributes. |
482484 * | *Text node* | Found a #text node; this is plaintext and modifiable. |
485+ * | *CDATA node* | Found a CDATA section; this is modifiable. |
486+ * | *PI node* | Found a Processing Instruction; this is modifiable. |
483487 * | *Comment* | Found a comment or bogus comment; this is modifiable. |
484488 * | *Presumptuous* | Found an empty tag closer: `</>`. |
485489 * | *Funky comment* | Found a tag closer with an invalid tag name; this is modifiable. |
@@ -491,6 +495,8 @@ class WP_HTML_Tag_Processor {
491495 * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE
492496 * @see WP_HTML_Tag_Processor::STATE_MATCHED_TAG
493497 * @see WP_HTML_Tag_Processor::STATE_TEXT_NODE
498+ * @see WP_HTML_Tag_Processor::STATE_CDATA_NODE
499+ * @see WP_HTML_Tag_Processor::STATE_PI_NODE
494500 * @see WP_HTML_Tag_Processor::STATE_COMMENT
495501 * @see WP_HTML_Tag_Processor::STATE_DOCTYPE
496502 * @see WP_HTML_Tag_Processor::STATE_PRESUMPTUOUS_TAG
@@ -800,6 +806,7 @@ public function next_tag( $query = null ) {
800806 * - a text node - the plaintext inside tags.
801807 * - an HTML comment.
802808 * - a DOCTYPE declaration.
809+ * - a processing instruction, e.g. `<?xml version="1.0" ?>`.
803810 *
804811 * The Tag Processor currently only supports the tag token.
805812 *
@@ -1665,7 +1672,21 @@ private function parse_next_tag() {
16651672 $ this ->text_length = $ closer_at - $ this ->text_starts_at ;
16661673 $ this ->bytes_already_parsed = $ closer_at + 1 ;
16671674
1668- // Identify nodes that would be CDATA if HTML had CDATA sections.
1675+ /*
1676+ * Identify nodes that would be CDATA if HTML had CDATA sections.
1677+ *
1678+ * This section must occur after identifying the bogus comment end
1679+ * because in an HTML parser it will span to the nearest `>`, even
1680+ * if there's no `]]>` as would be required in an XML document. It
1681+ * is therefore not possible to parse a CDATA section containing
1682+ * a `>` in the HTML syntax.
1683+ *
1684+ * Inside foreign elements there is a discrepancy between browsers
1685+ * and the specification on this.
1686+ *
1687+ * @todo Track whether the Tag Processor is inside a foreign element
1688+ * and require the proper closing `]]>` in those cases.
1689+ */
16691690 if (
16701691 $ this ->token_length >= 10 &&
16711692 '[ ' === $ html [ $ this ->token_starts_at + 2 ] &&
@@ -1722,13 +1743,18 @@ private function parse_next_tag() {
17221743 /*
17231744 * Identify a Processing Instruction node were HTML to have them.
17241745 *
1725- * XML allows for more target names, but this code only identifies
1726- * a subset. This is more or less okay because ultimately these are
1727- * HTML comments in the DOM and this safely supports _some_ kinds
1728- * of PI Nodes without getting lost while parsing.
1746+ * This section must occur after identifying the bogus comment end
1747+ * because in an HTML parser it will span to the nearest `>`, even
1748+ * if there's no `?>` as would be required in an XML document. It
1749+ * is therefore not possible to parse a Processing Instruction node
1750+ * containing a `>` in the HTML syntax.
17291751 *
1730- * This code identifies processing instruction nodes whose target
1731- * name can be represented in single-byte UTF-8 / 7-bit ASCII.
1752+ * XML allows for more target names, but this code only identifies
1753+ * those with ASCII-representable target names. This means that it
1754+ * may identify some Processing Instruction nodes as bogus comments,
1755+ * but it will not misinterpret the HTML structure. By limiting the
1756+ * identification to these target names the Tag Processor can avoid
1757+ * the need to start parsing UTF-8 sequences.
17321758 *
17331759 * > NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] |
17341760 * [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
@@ -1743,13 +1769,13 @@ private function parse_next_tag() {
17431769 $ pi_target_length = strspn ( $ comment_text , 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_ ' );
17441770
17451771 if ( 0 < $ pi_target_length ) {
1746- $ pi_target_length += strspn ( $ comment_text ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_-. ' , $ pi_target_length );
1772+ $ pi_target_length += strspn ( $ comment_text , 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_-. ' , $ pi_target_length );
17471773
1748- $ this ->parser_state = self ::STATE_PI_NODE ;
1749- $ this ->tag_name_starts_at = $ this ->token_starts_at + 2 ;
1750- $ this ->tag_name_length = $ pi_target_length ;
1751- $ this ->text_starts_at += $ pi_target_length ;
1752- $ this ->text_length -= $ pi_target_length + 1 ;
1774+ $ this ->parser_state = self ::STATE_PI_NODE ;
1775+ $ this ->tag_name_starts_at = $ this ->token_starts_at + 2 ;
1776+ $ this ->tag_name_length = $ pi_target_length ;
1777+ $ this ->text_starts_at += $ pi_target_length ;
1778+ $ this ->text_length -= $ pi_target_length + 1 ;
17531779 }
17541780 }
17551781
@@ -2544,6 +2570,8 @@ public function is_tag_closer() {
25442570 * Possible values:
25452571 * - `#tag` when matched on a tag.
25462572 * - `#text` when matched on a text node.
2573+ * - `#cdata-section` when matched on a CDATA node.
2574+ * - `#processing-instruction` when matched on a processing instruction.
25472575 * - `#comment` when matched on a comment.
25482576 * - `#presumptuous-tag` when matched on an empty tag closer.
25492577 * - `#funky-comment` when matched on a funky comment.
@@ -2577,6 +2605,7 @@ public function get_token_type() {
25772605 *
25782606 * Dynamic names:
25792607 * - Uppercase tag name for tag matches.
2608+ * - Target name for processing instructions.
25802609 * - `html` for DOCTYPE declarations.
25812610 *
25822611 * Note that if the Tag Processor is not matched on a token
0 commit comments