docs: tighten skill for unfamiliar agents — tokenization nuances + fix article examples

dani-polani · claude · dani-polani · commit f0d50efcd47f · 2026-06-23T04:22:49.000+03:00
- Fix two remaining "article bug" examples (world → le + monde, not just monde):
  SKILL.md minimal request and references custom-style example
- Replace minimal request with an unambiguous 1:1 example (I sleep / Я сплю)
- Rewrite "Word index counting" to spell out the nuances an unfamiliar
  agent needs: split char is consumed/not rendered, tokenSplitChars is
  configurable and shifts indices, punctuation stays attached, merge char,
  RTL reading-order indexing
- Add Tokenization, Constraints &amp; errors sections to references/api.md
- Clarify GET is lines-only (ignores alignments/settings/pairs)

All 10 examples and the punctuation/merge claims verified against the live API.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/word-aligner-skill.zip b/word-aligner-skill.zip
diff --git a/word-aligner-skill/SKILL.md b/word-aligner-skill/SKILL.md
@@ -14,12 +14,12 @@ Word Aligner generates shareable interactive diagrams showing which words in one
 
 ```json
 {
-  "lines": ["Hello world", "Bonjour le monde"],
-  "alignments": [[0, 0, 1, 0], [0, 1, 1, 2]]
+  "lines": ["I sleep", "Я сплю"],
+  "alignments": [[0, 0, 1, 0], [0, 1, 1, 1]]
 }
 ```
 
-`alignments` entries are `[lineA, wordA, lineB, wordB]` — 0-based indices, lines must be adjacent.
+`alignments` entries are `[lineA, wordA, lineB, wordB]`: word `wordA` of line `lineA` links to word `wordB` of line `lineB`. All indices are 0-based, and the two lines must be vertically adjacent (`|lineA − lineB| = 1`).
 
 ## Workflow
 
@@ -28,11 +28,19 @@ Word Aligner generates shareable interactive diagrams showing which words in one
 3. Call the API.
 4. Return the `url` to the user with a brief explanation.
 
-## Word index counting
+## Word index counting (read carefully — this is the #1 source of mistakes)
+
+Word indices are token positions, so you must tokenize a line exactly the way the service does before assigning indices:
+
+1. **Whitespace always splits.** `"I have been going"` → `I`[0] `have`[1] `been`[2] `going`[3].
+2. **The `tokenSplitChars` characters also split, and are then removed from the output.** The default set is `.-|`. So `"go.PST.IPFV"` becomes three *separate* tokens `go` `PST` `IPFV` and **the dots disappear from the rendered diagram**. This is usually not what you want for Leipzig glosses — see the gloss pattern below, which sets `tokenSplitChars` to `"-|"` to keep the dots.
+3. **Punctuation stays attached by default.** `"Hello, world!"` → `Hello,`[0] `world!`[1] (the comma and exclamation mark are part of the tokens, not separate).
+4. **The merge char `+` joins parts into one token** rendered with a space: `"is+playing"` is a single token (index counts as one word) that displays as `is playing`.
+5. **RTL lines:** word 0 is the logically first word (the rightmost one on screen for Hebrew/Arabic). Index in reading order, not visual order.
 
-Count left to right from 0, splitting on whitespace. Characters `.` `-` `|` also split. For RTL lines, word 0 is the logically first word (rightmost on screen).
+Whenever you set `tokenSplitChars` in `settings`, recount every line's indices using that same split set — changing it shifts all the indices on every line.
 
-If uncertain about tokenization, call `GET https://aligner.tinygods.dev/api/align?lines=your+text` first and open the URL to count word boxes in the editor.
+If unsure, call `GET https://aligner.tinygods.dev/api/align?lines=your+text` first and open the URL to count the word boxes in the editor.
 
 ## Common patterns
 
diff --git a/word-aligner-skill/references/api.md b/word-aligner-skill/references/api.md
@@ -77,18 +77,49 @@ Controls for a specific adjacent line pair. `lower` must equal `upper + 1`.
 
 Each tuple is `[lineA, wordA, lineB, wordB]`:
 - All indices are **0-based**
-- `lineA` and `lineB` must be **adjacent**: `|lineA − lineB| = 1`
-- Multiple tuples sharing the same word form a **color group** automatically
+- `lineA` and `lineB` must be **adjacent**: `|lineA − lineB| = 1` (you cannot connect across a line; stack intermediate tiers instead)
+- Multiple tuples sharing the same word form a **color group** automatically (one-to-many, many-to-one, and many-to-many all work)
 
 ---
 
-## GET /api/align (simple, no alignments)
+## Constraints and errors
+
+- **1–8 lines.** Fewer than 1 or more than 8 is rejected.
+- **Adjacency:** alignment lines must differ by exactly 1; `pairs` require `lower = upper + 1`.
+- **Index ranges:** line and word indices must be in range for the (tokenized) text.
+- Numeric settings are clamped, not rejected: `lineThickness`→1–8, `lineOpacity`→0.2–1, `sizePx`→12–64, line `gapPx`→0–56, pair `gapPx`→12–156.
+
+On invalid input the API returns **HTTP 400** with a JSON body:
+
+```json
+{ "error": "alignments[0]: word 4 out of range for line 0 (\"1SG.NOM go.PST.IPFV\" has 2 word(s))" }
+```
+
+The error message names the offending field, index, and the tokenized word count — read it to fix indices.
+
+---
+
+## Tokenization and word indices
+
+Word indices in `alignments` and `pairs` refer to **token positions**, so tokenize each line the way the service does before counting:
+
+- **Whitespace always splits.**
+- **`tokenSplitChars` (default `.-|`) also splits, and the split character is removed from the rendered output.** `"go.PST.IPFV"` → three tokens `go` `PST` `IPFV` with the dots gone. Override `tokenSplitChars` (e.g. to `"-|"`) to keep characters you want displayed.
+- **Punctuation stays attached by default** (the API does not split punctuation). `"Hello, world!"` → `Hello,`[0] `world!`[1].
+- **The merge char `+` (default) joins parts into one token** displayed with a space: `"is+playing"` is one token rendered `is playing`.
+- **RTL lines are indexed in reading order** — word 0 is the logically first word (rightmost on screen).
+
+Changing `tokenSplitChars` shifts every line's indices — recount after setting it.
+
+---
+
+## GET /api/align (simple, lines only)
 
 ```
 GET /api/align?lines=Hello+world&lines=Bonjour+le+monde
 ```
 
-Returns the same `{ "url": "..." }` response. Useful for opening the editor pre-filled with text, without pre-drawn links. Helpful for verifying word tokenization.
+Returns the same `{ "url": "..." }` response. **Lines only** — this endpoint ignores `alignments`, `settings`, and `pairs`; use POST for those. Useful for opening the editor pre-filled with text (no pre-drawn links) and for verifying how a line tokenizes: open the URL and count the word boxes.
 
 ---
 
@@ -107,7 +138,7 @@ Returns the same `{ "url": "..." }` response. Useful for opening the editor pre-
 ```json
 {
   "lines": ["Hello world", "Bonjour le monde"],
-  "alignments": [[0,0,1,0], [0,1,1,2]],
+  "alignments": [[0,0,1,0], [0,1,1,1], [0,1,1,2]],
   "settings": {
     "palette": "vivid",
     "lineStyle": "straight",