Skip to content

Commit 6b90dfe

Browse files
authored
Exclude symbols from auto-generated heading identifiers (#181)
The djot spec forms a heading's auto identifier from its plain text content "excluding non-textual elements such as footnote references and symbols". djot-php already dropped footnote references but emitted symbols as `:name:`, so `# Release notes :tada:` produced the ID `Release-notes-tada` instead of `Release-notes`. extractPlainText() gains a forId mode that skips Symbol and FootnoteRef nodes. generateId() still warms the plain-text cache so display consumers (TOC labels, permalinks) keep the symbol text, but builds the identifier from the symbol/footnote-excluded text. As a result a heading whose only content is a symbol now correctly falls back to a generated s-N identifier. The deliberate CSS-validity deviations (apostrophe/quote/colon replacement, leading-digit prefix) are unchanged and now documented against the settled spec wording.
1 parent f5482be commit 6b90dfe

3 files changed

Lines changed: 140 additions & 22 deletions

File tree

docs/reference/enhancements.md

Lines changed: 31 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ They are either on the way to get incorporated upstream - or may be incorporated
99

1010
- [Tab Indentation Support](#tab-indentation-support)
1111
- [Multiple Footnote References](#multiple-footnote-references)
12-
- [Section ID Excludes Footnote Markers](#section-id-excludes-footnote-markers)
12+
- [Section ID Excludes Footnote Markers and Symbols](#section-id-excludes-footnote-markers-and-symbols)
1313
- [CSS-Safe Heading IDs](#css-safe-heading-ids)
1414
- [Symbol Parsing in Time Formats](#symbol-parsing-in-time-formats)
1515
- [Em/En Dash with Unmatched Braces](#em-en-dash-with-unmatched-braces)
@@ -117,13 +117,15 @@ and third<a id="fnref1-3" href="#fn1" role="doc-noteref"><sup>1</sup></a>.</p>
117117

118118
---
119119

120-
## Section ID Excludes Footnote Markers
120+
## Section ID Excludes Footnote Markers and Symbols
121121

122-
**Related:** [jgm/djot#349](https://github.com/jgm/djot/issues/349)
122+
**Related:** [jgm/djot#349](https://github.com/jgm/djot/issues/349), [jgm/djot#393](https://github.com/jgm/djot/pull/393)
123123

124124
**Status:** Implemented in djot-php
125125

126-
Auto-generated section IDs correctly exclude footnote reference markers:
126+
Per the djot spec, an auto-generated identifier is formed from the plain text
127+
content of the heading *"excluding non-textual elements such as footnote
128+
references and symbols"*. djot-php excludes both:
127129

128130
```djot
129131
# Introduction[^1]
@@ -140,6 +142,16 @@ Auto-generated section IDs correctly exclude footnote reference markers:
140142

141143
The ID is `Introduction`, not `Introduction1` or `Introduction[^1]`.
142144

145+
Symbols are likewise dropped from the identifier (but kept in the
146+
human-readable plain text used for things like TOC labels):
147+
148+
```djot
149+
# Release notes :tada:
150+
```
151+
152+
The ID is `Release-notes`, not `Release-notes-tada`. A heading whose only
153+
content is a symbol falls back to a generated `s-N` ID.
154+
143155
---
144156

145157
## CSS-Safe Heading IDs
@@ -219,19 +231,23 @@ Explicit IDs are used as-is without normalization.
219231

220232
### Spec Alignment
221233

222-
The djot spec's wording on auto-ID generation is being clarified in [jgm/djot#391](https://github.com/jgm/djot/issues/391). djot-php's normalization aligns with the proposed direction in most respects and deliberately deviates in two places — both motivated by producing valid CSS identifiers for `querySelector()` consumers.
234+
The remove-vs-replace question raised in [jgm/djot#391](https://github.com/jgm/djot/issues/391) was settled by [jgm/djot#393](https://github.com/jgm/djot/pull/393), which reworded the spec to: *"replacing each maximal run of non-alphanumeric ASCII characters with `-`, removing any leading or trailing `-`"*. Note that #393 only changes the spec **prose** — the djot.js reference implementation is unchanged and (per djot's own changelog policy) remains the authoritative behavior. The new prose is actually broader than djot.js itself: it would also strip `_`, which djot.js keeps.
235+
236+
djot-php replaces (does not remove) mid-word punctuation — the direction #393 settled on — and tracks the djot.js **implementation** where the prose and implementation disagree, deliberately deviating only where required to produce valid CSS identifiers for `querySelector()` consumers.
223237

224-
| Aspect | djot.js / djoths (proposed spec) | djot-php |
225-
|--------|---------------------------------|----------|
226-
| Mid-word punctuation (`A+B=C`) | replace with `-``A-B-C` | replace with `-``A-B-C` |
227-
| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` |
228-
| Consecutive punctuation (`foo...bar`) | collapse to single `-``foo-bar` | collapse to single `-``foo-bar` |
229-
| Apostrophe (`That's all`) | preserve → `That's-all` | replace with `-``That-s-all` |
230-
| Double quote / `;` / `:` | preserve | replace with `-` |
231-
| Leading digit (`2024 recap`) | unspecified | prefix with `h-``h-2024-recap` |
232-
| Empty result (`!!!`) | unspecified | fallback → `heading` |
238+
| Aspect | djot.js reference impl | #393 spec prose | djot-php |
239+
|--------|------------------------|-----------------|----------|
240+
| Mid-word punctuation (`A+B=C`) | `A-B-C` | `A-B-C` | `A-B-C` |
241+
| Non-ASCII letters (`Über uns`) | preserve → `Über-uns` | preserve → `Über-uns` | preserve → `Über-uns` |
242+
| Consecutive punctuation (`foo...bar`) | collapse → `foo-bar` | collapse → `foo-bar` | collapse → `foo-bar` |
243+
| Underscore (`foo_bar`) | keep → `foo_bar` | strip → `foo-bar` | keep → `foo_bar` (follows impl; CSS-valid) |
244+
| Apostrophe (`That's all`) | preserve → `That's-all` | replace → `That-s-all` | replace → `That-s-all` (CSS-safe) |
245+
| Double quote / `;` / `:` | preserve | replace | replace with `-` (CSS-safe) |
246+
| Leading digit (`2024 recap`) | `2024-recap` | `2024-recap` | prefix → `h-2024-recap` (CSS requires non-digit start) |
247+
| Empty result (`!!!`) | `s-N` family | unspecified | fallback → `heading` |
248+
| Symbols / footnote refs | excluded | excluded | excluded |
233249

234-
The apostrophe / quote / semicolon / colon deviation is deliberate: these characters are not valid in unescaped CSS identifiers, so preserving them per the spec would force every JS consumer to round-trip through `CSS.escape()` before doing a selector lookup. The leading-digit and empty-result behaviors fill in spec gaps that other implementations handle inconsistently.
250+
The apostrophe / quote / semicolon / colon deviation is deliberate: these characters are not valid in unescaped CSS identifiers, so preserving them per djot.js would force every JS consumer to round-trip through `CSS.escape()` before doing a selector lookup. The leading-digit and empty-result behaviors fill in gaps that the spec and implementation handle inconsistently.
235251

236252
---
237253

src/Renderer/HeadingIdTracker.php

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
use Djot\Node\Block\Heading;
88
use Djot\Node\Inline\Code;
9+
use Djot\Node\Inline\FootnoteRef;
910
use Djot\Node\Inline\HardBreak;
1011
use Djot\Node\Inline\Math;
1112
use Djot\Node\Inline\SoftBreak;
@@ -146,21 +147,33 @@ public function getPlainText(Node $node): string
146147

147148
/**
148149
* Recursively extract plain text from a node tree
150+
*
151+
* When $forId is true, non-textual elements that the djot spec excludes
152+
* from auto-generated heading identifiers are skipped: symbols (`:name:`)
153+
* and footnote references (`[^label]`). See jgm/djot#393. Otherwise the
154+
* full human-readable text is returned (e.g. for TOC labels), with
155+
* symbols rendered as `:name:`.
149156
*/
150-
protected function extractPlainText(Node $node): string
157+
protected function extractPlainText(Node $node, bool $forId = false): string
151158
{
152159
$text = '';
153160
foreach ($node->getChildren() as $child) {
161+
if ($child instanceof FootnoteRef) {
162+
continue;
163+
}
164+
154165
if ($child instanceof Text) {
155166
$text .= $child->getContent();
156167
} elseif ($child instanceof SoftBreak || $child instanceof HardBreak) {
157168
$text .= ' ';
158169
} elseif ($child instanceof Code || $child instanceof Math) {
159170
$text .= $child->getContent();
160171
} elseif ($child instanceof Symbol) {
161-
$text .= ':' . $child->getName() . ':';
172+
if (!$forId) {
173+
$text .= ':' . $child->getName() . ':';
174+
}
162175
} elseif ($child instanceof Node) {
163-
$text .= $this->extractPlainText($child);
176+
$text .= $this->extractPlainText($child, $forId);
164177
}
165178
}
166179

@@ -195,17 +208,23 @@ protected function generateId(Heading $node): string
195208
return $id;
196209
}
197210

198-
// Generate from heading text
199-
$headingText = $this->getPlainText($node);
211+
// Warm the plain-text cache so display consumers (TOC, permalinks)
212+
// still see the pre-mutation text including symbols.
213+
$this->getPlainText($node);
214+
215+
// The identifier itself is formed from the plain text content
216+
// excluding non-textual elements such as symbols and footnote
217+
// references (jgm/djot#393).
218+
$idText = $this->extractPlainText($node, forId: true);
200219

201-
if ($headingText === '') {
220+
if ($idText === '') {
202221
// Generate fallback ID
203222
$this->sectionCounter++;
204223

205224
return 's-' . $this->sectionCounter;
206225
}
207226

208-
$baseId = $this->normalizeId($headingText);
227+
$baseId = $this->normalizeId($idText);
209228

210229
// Track and deduplicate
211230
if (!isset($this->usedIds[$baseId])) {

tests/TestCase/Renderer/HeadingIdTrackerTest.php

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,11 @@
55
namespace Djot\Test\TestCase\Renderer;
66

77
use Djot\Node\Block\Heading;
8+
use Djot\Node\Inline\FootnoteRef;
89
use Djot\Node\Inline\HardBreak;
910
use Djot\Node\Inline\SoftBreak;
1011
use Djot\Node\Inline\Strong;
12+
use Djot\Node\Inline\Symbol;
1113
use Djot\Node\Inline\Text;
1214
use Djot\Renderer\HeadingIdTracker;
1315
use PHPUnit\Framework\TestCase;
@@ -278,4 +280,85 @@ public function testGetIdForHeadingAlsoCachesPlainText(): void
278280
$text = $this->tracker->getPlainText($heading);
279281
$this->assertSame('Title', $text);
280282
}
283+
284+
/**
285+
* The djot spec (and jgm/djot#393) says auto-generated identifiers are formed
286+
* from the plain text content "excluding non-textual elements such as footnote
287+
* references and symbols". A symbol must not leak into the ID.
288+
*/
289+
public function testSymbolsExcludedFromId(): void
290+
{
291+
$heading = new Heading(2);
292+
$heading->appendChild(new Text('Introduction '));
293+
$heading->appendChild(new Symbol('smile'));
294+
295+
$id = $this->tracker->getIdForHeading($heading);
296+
297+
$this->assertSame('Introduction', $id);
298+
}
299+
300+
public function testHeadingWithOnlySymbolGetsFallbackId(): void
301+
{
302+
$heading = new Heading(2);
303+
$heading->appendChild(new Symbol('tada'));
304+
305+
$id = $this->tracker->getIdForHeading($heading);
306+
307+
$this->assertSame('s-1', $id);
308+
}
309+
310+
public function testSymbolBetweenWordsDoesNotProduceStrayDashes(): void
311+
{
312+
$heading = new Heading(2);
313+
$heading->appendChild(new Text('Build'));
314+
$heading->appendChild(new Symbol('rocket'));
315+
$heading->appendChild(new Text('Status'));
316+
317+
$id = $this->tracker->getIdForHeading($heading);
318+
319+
$this->assertSame('BuildStatus', $id);
320+
}
321+
322+
/**
323+
* Footnote references are likewise excluded from the identifier:
324+
* `# Introduction[^1]` generates `Introduction`, not `Introduction1`.
325+
*/
326+
public function testFootnoteReferenceExcludedFromId(): void
327+
{
328+
$heading = new Heading(2);
329+
$heading->appendChild(new Text('Introduction'));
330+
$heading->appendChild(new FootnoteRef('1'));
331+
332+
$id = $this->tracker->getIdForHeading($heading);
333+
334+
$this->assertSame('Introduction', $id);
335+
}
336+
337+
/**
338+
* Symbols are still part of the human-readable plain text (e.g. for TOC
339+
* labels); only the *identifier* excludes them. This pins that boundary.
340+
*/
341+
public function testSymbolsRetainedInPlainText(): void
342+
{
343+
$heading = new Heading(2);
344+
$heading->appendChild(new Text('Introduction '));
345+
$heading->appendChild(new Symbol('smile'));
346+
347+
$this->assertSame('Introduction :smile:', $this->tracker->getPlainText($heading));
348+
}
349+
350+
/**
351+
* djot.js keeps `_` (it is not in its punctuation denylist) and it is a
352+
* valid CSS identifier character, so djot-php keeps it too. This pins the
353+
* deliberate divergence from the looser #393 spec prose.
354+
*/
355+
public function testUnderscoreRetainedInId(): void
356+
{
357+
$heading = new Heading(2);
358+
$heading->appendChild(new Text('foo_bar baz'));
359+
360+
$id = $this->tracker->getIdForHeading($heading);
361+
362+
$this->assertSame('foo_bar-baz', $id);
363+
}
281364
}

0 commit comments

Comments
 (0)