Exclude symbols from auto-generated heading identifiers (#181)

dereuromark · web-flow · commit 6b90dfe41d16 · 2026-05-19T19:02:01.000+02:00
The djot spec forms a heading's auto identifier from its plain text
content "excluding non-textual elements such as footnote references
and symbols". djot-php already dropped footnote references but emitted
symbols as `:name:`, so `# Release notes :tada:` produced the ID
`Release-notes-tada` instead of `Release-notes`.

extractPlainText() gains a forId mode that skips Symbol and
FootnoteRef nodes. generateId() still warms the plain-text cache so
display consumers (TOC labels, permalinks) keep the symbol text, but
builds the identifier from the symbol/footnote-excluded text. As a
result a heading whose only content is a symbol now correctly falls
back to a generated s-N identifier.

The deliberate CSS-validity deviations (apostrophe/quote/colon
replacement, leading-digit prefix) are unchanged and now documented
against the settled spec wording.
diff --git a/docs/reference/enhancements.md b/docs/reference/enhancements.md
@@ -9,7 +9,7 @@ They are either on the way to get incorporated upstream - or may be incorporated
 
 - [Tab Indentation Support](#tab-indentation-support)
 - [Multiple Footnote References](#multiple-footnote-references)
-- [Section ID Excludes Footnote Markers](#section-id-excludes-footnote-markers)
+- [Section ID Excludes Footnote Markers and Symbols](#section-id-excludes-footnote-markers-and-symbols)
 - [CSS-Safe Heading IDs](#css-safe-heading-ids)
 - [Symbol Parsing in Time Formats](#symbol-parsing-in-time-formats)
 - [Em/En Dash with Unmatched Braces](#em-en-dash-with-unmatched-braces)
@@ -117,13 +117,15 @@ and third<a id="fnref1-3" href="#fn1" role="doc-noteref"><sup>1</sup></a>.</p>
 
 ---
 
-## Section ID Excludes Footnote Markers
+## Section ID Excludes Footnote Markers and Symbols
 
-**Related:** [jgm/djot#349](https://github.com/jgm/djot/issues/349)
+**Related:** [jgm/djot#349](https://github.com/jgm/djot/issues/349), [jgm/djot#393](https://github.com/jgm/djot/pull/393)
 
 **Status:** Implemented in djot-php
 
-Auto-generated section IDs correctly exclude footnote reference markers:
+Per the djot spec, an auto-generated identifier is formed from the plain text
+content of the heading *"excluding non-textual elements such as footnote
+references and symbols"*. djot-php excludes both:
 
 ```djot
 # Introduction[^1]
@@ -140,6 +142,16 @@ Auto-generated section IDs correctly exclude footnote reference markers:
 
 The ID is `Introduction`, not `Introduction1` or `Introduction[^1]`.
 
+Symbols are likewise dropped from the identifier (but kept in the
+human-readable plain text used for things like TOC labels):
+
+```djot
+# Release notes :tada:
+```
+
+The ID is `Release-notes`, not `Release-notes-tada`. A heading whose only
+content is a symbol falls back to a generated `s-N` ID.
+
 ---
 
 ## CSS-Safe Heading IDs
@@ -219,19 +231,23 @@ Explicit IDs are used as-is without normalization.
 
 ### Spec Alignment
 
-The djot spec's wording on auto-ID generation is being clarified in [jgm/djot#391](https://github.com/jgm/djot/issues/391). djot-php's normalization aligns with the proposed direction in most respects and deliberately deviates in two places — both motivated by producing valid CSS identifiers for `querySelector()` consumers.
+The remove-vs-replace question raised in [jgm/djot#391](https://github.com/jgm/djot/issues/391) was settled by [jgm/djot#393](https://github.com/jgm/djot/pull/393), which reworded the spec to: *"replacing each maximal run of non-alphanumeric ASCII characters with `-`, removing any leading or trailing `-`"*. Note that #393 only changes the spec **prose** — the djot.js reference implementation is unchanged and (per djot's own changelog policy) remains the authoritative behavior. The new prose is actually broader than djot.js itself: it would also strip `_`, which djot.js keeps.
+
+djot-php replaces (does not remove) mid-word punctuation — the direction #393 settled on — and tracks the djot.js **implementation** where the prose and implementation disagree, deliberately deviating only where required to produce valid CSS identifiers for `querySelector()` consumers.
 
-| Aspect | djot.js / djoths (proposed spec) | djot-php |
-|--------|---------------------------------|----------|
-| Mid-word punctuation (`A+B=C`) | replace with `-` → `A-B-C` | replace with `-` → `A-B-C` |
-| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` |
-| Consecutive punctuation (`foo...bar`) | collapse to single `-` → `foo-bar` | collapse to single `-` → `foo-bar` |
-| Apostrophe (`That's all`) | preserve → `That's-all` | replace with `-` → `That-s-all` |
-| Double quote / `;` / `:` | preserve | replace with `-` |
-| Leading digit (`2024 recap`) | unspecified | prefix with `h-` → `h-2024-recap` |
-| Empty result (`!!!`) | unspecified | fallback → `heading` |
+| Aspect | djot.js reference impl | #393 spec prose | djot-php |
+|--------|------------------------|-----------------|----------|
+| Mid-word punctuation (`A+B=C`) | `A-B-C` | `A-B-C` | `A-B-C` |
+| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` | preserve → `Über-uns` |
+| Consecutive punctuation (`foo...bar`) | collapse → `foo-bar` | collapse → `foo-bar` | collapse → `foo-bar` |
+| Underscore (`foo_bar`) | keep → `foo_bar` | strip → `foo-bar` | keep → `foo_bar` (follows impl; CSS-valid) |
+| Apostrophe (`That's all`) | preserve → `That's-all` | replace → `That-s-all` | replace → `That-s-all` (CSS-safe) |
+| Double quote / `;` / `:` | preserve | replace | replace with `-` (CSS-safe) |
+| Leading digit (`2024 recap`) | `2024-recap` | `2024-recap` | prefix → `h-2024-recap` (CSS requires non-digit start) |
+| Empty result (`!!!`) | `s-N` family | unspecified | fallback → `heading` |
+| Symbols / footnote refs | excluded | excluded | excluded |
 
-The apostrophe / quote / semicolon / colon deviation is deliberate: these characters are not valid in unescaped CSS identifiers, so preserving them per the spec would force every JS consumer to round-trip through `CSS.escape()` before doing a selector lookup. The leading-digit and empty-result behaviors fill in spec gaps that other implementations handle inconsistently.
+The apostrophe / quote / semicolon / colon deviation is deliberate: these characters are not valid in unescaped CSS identifiers, so preserving them per djot.js would force every JS consumer to round-trip through `CSS.escape()` before doing a selector lookup. The leading-digit and empty-result behaviors fill in gaps that the spec and implementation handle inconsistently.
 
 ---
 
diff --git a/src/Renderer/HeadingIdTracker.php b/src/Renderer/HeadingIdTracker.php
@@ -6,6 +6,7 @@
 
 use Djot\Node\Block\Heading;
 use Djot\Node\Inline\Code;
+use Djot\Node\Inline\FootnoteRef;
 use Djot\Node\Inline\HardBreak;
 use Djot\Node\Inline\Math;
 use Djot\Node\Inline\SoftBreak;
@@ -146,21 +147,33 @@ public function getPlainText(Node $node): string
 
     /**
      * Recursively extract plain text from a node tree
+     *
+     * When $forId is true, non-textual elements that the djot spec excludes
+     * from auto-generated heading identifiers are skipped: symbols (`:name:`)
+     * and footnote references (`[^label]`). See jgm/djot#393. Otherwise the
+     * full human-readable text is returned (e.g. for TOC labels), with
+     * symbols rendered as `:name:`.
      */
-    protected function extractPlainText(Node $node): string
+    protected function extractPlainText(Node $node, bool $forId = false): string
     {
         $text = '';
         foreach ($node->getChildren() as $child) {
+            if ($child instanceof FootnoteRef) {
+                continue;
+            }
+
             if ($child instanceof Text) {
                 $text .= $child->getContent();
             } elseif ($child instanceof SoftBreak || $child instanceof HardBreak) {
                 $text .= ' ';
             } elseif ($child instanceof Code || $child instanceof Math) {
                 $text .= $child->getContent();
             } elseif ($child instanceof Symbol) {
-                $text .= ':' . $child->getName() . ':';
+                if (!$forId) {
+                    $text .= ':' . $child->getName() . ':';
+                }
             } elseif ($child instanceof Node) {
-                $text .= $this->extractPlainText($child);
+                $text .= $this->extractPlainText($child, $forId);
             }
         }
 
@@ -195,17 +208,23 @@ protected function generateId(Heading $node): string
             return $id;
         }
 
-        // Generate from heading text
-        $headingText = $this->getPlainText($node);
+        // Warm the plain-text cache so display consumers (TOC, permalinks)
+        // still see the pre-mutation text including symbols.
+        $this->getPlainText($node);
+
+        // The identifier itself is formed from the plain text content
+        // excluding non-textual elements such as symbols and footnote
+        // references (jgm/djot#393).
+        $idText = $this->extractPlainText($node, forId: true);
 
-        if ($headingText === '') {
+        if ($idText === '') {
             // Generate fallback ID
             $this->sectionCounter++;
 
             return 's-' . $this->sectionCounter;
         }
 
-        $baseId = $this->normalizeId($headingText);
+        $baseId = $this->normalizeId($idText);
 
         // Track and deduplicate
         if (!isset($this->usedIds[$baseId])) {
diff --git a/tests/TestCase/Renderer/HeadingIdTrackerTest.php b/tests/TestCase/Renderer/HeadingIdTrackerTest.php
@@ -5,9 +5,11 @@
 namespace Djot\Test\TestCase\Renderer;
 
 use Djot\Node\Block\Heading;
+use Djot\Node\Inline\FootnoteRef;
 use Djot\Node\Inline\HardBreak;
 use Djot\Node\Inline\SoftBreak;
 use Djot\Node\Inline\Strong;
+use Djot\Node\Inline\Symbol;
 use Djot\Node\Inline\Text;
 use Djot\Renderer\HeadingIdTracker;
 use PHPUnit\Framework\TestCase;
@@ -278,4 +280,85 @@ public function testGetIdForHeadingAlsoCachesPlainText(): void
         $text = $this->tracker->getPlainText($heading);
         $this->assertSame('Title', $text);
     }
+
+    /**
+     * The djot spec (and jgm/djot#393) says auto-generated identifiers are formed
+     * from the plain text content "excluding non-textual elements such as footnote
+     * references and symbols". A symbol must not leak into the ID.
+     */
+    public function testSymbolsExcludedFromId(): void
+    {
+        $heading = new Heading(2);
+        $heading->appendChild(new Text('Introduction '));
+        $heading->appendChild(new Symbol('smile'));
+
+        $id = $this->tracker->getIdForHeading($heading);
+
+        $this->assertSame('Introduction', $id);
+    }
+
+    public function testHeadingWithOnlySymbolGetsFallbackId(): void
+    {
+        $heading = new Heading(2);
+        $heading->appendChild(new Symbol('tada'));
+
+        $id = $this->tracker->getIdForHeading($heading);
+
+        $this->assertSame('s-1', $id);
+    }
+
+    public function testSymbolBetweenWordsDoesNotProduceStrayDashes(): void
+    {
+        $heading = new Heading(2);
+        $heading->appendChild(new Text('Build'));
+        $heading->appendChild(new Symbol('rocket'));
+        $heading->appendChild(new Text('Status'));
+
+        $id = $this->tracker->getIdForHeading($heading);
+
+        $this->assertSame('BuildStatus', $id);
+    }
+
+    /**
+     * Footnote references are likewise excluded from the identifier:
+     * `# Introduction[^1]` generates `Introduction`, not `Introduction1`.
+     */
+    public function testFootnoteReferenceExcludedFromId(): void
+    {
+        $heading = new Heading(2);
+        $heading->appendChild(new Text('Introduction'));
+        $heading->appendChild(new FootnoteRef('1'));
+
+        $id = $this->tracker->getIdForHeading($heading);
+
+        $this->assertSame('Introduction', $id);
+    }
+
+    /**
+     * Symbols are still part of the human-readable plain text (e.g. for TOC
+     * labels); only the *identifier* excludes them. This pins that boundary.
+     */
+    public function testSymbolsRetainedInPlainText(): void
+    {
+        $heading = new Heading(2);
+        $heading->appendChild(new Text('Introduction '));
+        $heading->appendChild(new Symbol('smile'));
+
+        $this->assertSame('Introduction :smile:', $this->tracker->getPlainText($heading));
+    }
+
+    /**
+     * djot.js keeps `_` (it is not in its punctuation denylist) and it is a
+     * valid CSS identifier character, so djot-php keeps it too. This pins the
+     * deliberate divergence from the looser #393 spec prose.
+     */
+    public function testUnderscoreRetainedInId(): void
+    {
+        $heading = new Heading(2);
+        $heading->appendChild(new Text('foo_bar baz'));
+
+        $id = $this->tracker->getIdForHeading($heading);
+
+        $this->assertSame('foo_bar-baz', $id);
+    }
 }