Skip to content

Commit f0d50ef

Browse files
dani-polaniclaude
andcommitted
docs: tighten skill for unfamiliar agents — tokenization nuances + fix article examples
- Fix two remaining "article bug" examples (world → le + monde, not just monde): SKILL.md minimal request and references custom-style example - Replace minimal request with an unambiguous 1:1 example (I sleep / Я сплю) - Rewrite "Word index counting" to spell out the nuances an unfamiliar agent needs: split char is consumed/not rendered, tokenSplitChars is configurable and shifts indices, punctuation stays attached, merge char, RTL reading-order indexing - Add Tokenization, Constraints & errors sections to references/api.md - Clarify GET is lines-only (ignores alignments/settings/pairs) All 10 examples and the punctuation/merge claims verified against the live API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent f37353f commit f0d50ef

3 files changed

Lines changed: 50 additions & 11 deletions

File tree

word-aligner-skill.zip

1.34 KB
Binary file not shown.

word-aligner-skill/SKILL.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,12 @@ Word Aligner generates shareable interactive diagrams showing which words in one
1414

1515
```json
1616
{
17-
"lines": ["Hello world", "Bonjour le monde"],
18-
"alignments": [[0, 0, 1, 0], [0, 1, 1, 2]]
17+
"lines": ["I sleep", "Я сплю"],
18+
"alignments": [[0, 0, 1, 0], [0, 1, 1, 1]]
1919
}
2020
```
2121

22-
`alignments` entries are `[lineA, wordA, lineB, wordB]`0-based indices, lines must be adjacent.
22+
`alignments` entries are `[lineA, wordA, lineB, wordB]`: word `wordA` of line `lineA` links to word `wordB` of line `lineB`. All indices are 0-based, and the two lines must be vertically adjacent (`|lineA − lineB| = 1`).
2323

2424
## Workflow
2525

@@ -28,11 +28,19 @@ Word Aligner generates shareable interactive diagrams showing which words in one
2828
3. Call the API.
2929
4. Return the `url` to the user with a brief explanation.
3030

31-
## Word index counting
31+
## Word index counting (read carefully — this is the #1 source of mistakes)
32+
33+
Word indices are token positions, so you must tokenize a line exactly the way the service does before assigning indices:
34+
35+
1. **Whitespace always splits.** `"I have been going"``I`[0] `have`[1] `been`[2] `going`[3].
36+
2. **The `tokenSplitChars` characters also split, and are then removed from the output.** The default set is `.-|`. So `"go.PST.IPFV"` becomes three *separate* tokens `go` `PST` `IPFV` and **the dots disappear from the rendered diagram**. This is usually not what you want for Leipzig glosses — see the gloss pattern below, which sets `tokenSplitChars` to `"-|"` to keep the dots.
37+
3. **Punctuation stays attached by default.** `"Hello, world!"``Hello,`[0] `world!`[1] (the comma and exclamation mark are part of the tokens, not separate).
38+
4. **The merge char `+` joins parts into one token** rendered with a space: `"is+playing"` is a single token (index counts as one word) that displays as `is playing`.
39+
5. **RTL lines:** word 0 is the logically first word (the rightmost one on screen for Hebrew/Arabic). Index in reading order, not visual order.
3240

33-
Count left to right from 0, splitting on whitespace. Characters `.` `-` `|` also split. For RTL lines, word 0 is the logically first word (rightmost on screen).
41+
Whenever you set `tokenSplitChars` in `settings`, recount every line's indices using that same split set — changing it shifts all the indices on every line.
3442

35-
If uncertain about tokenization, call `GET https://aligner.tinygods.dev/api/align?lines=your+text` first and open the URL to count word boxes in the editor.
43+
If unsure, call `GET https://aligner.tinygods.dev/api/align?lines=your+text` first and open the URL to count the word boxes in the editor.
3644

3745
## Common patterns
3846

word-aligner-skill/references/api.md

Lines changed: 36 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -77,18 +77,49 @@ Controls for a specific adjacent line pair. `lower` must equal `upper + 1`.
7777

7878
Each tuple is `[lineA, wordA, lineB, wordB]`:
7979
- All indices are **0-based**
80-
- `lineA` and `lineB` must be **adjacent**: `|lineA − lineB| = 1`
81-
- Multiple tuples sharing the same word form a **color group** automatically
80+
- `lineA` and `lineB` must be **adjacent**: `|lineA − lineB| = 1` (you cannot connect across a line; stack intermediate tiers instead)
81+
- Multiple tuples sharing the same word form a **color group** automatically (one-to-many, many-to-one, and many-to-many all work)
8282

8383
---
8484

85-
## GET /api/align (simple, no alignments)
85+
## Constraints and errors
86+
87+
- **1–8 lines.** Fewer than 1 or more than 8 is rejected.
88+
- **Adjacency:** alignment lines must differ by exactly 1; `pairs` require `lower = upper + 1`.
89+
- **Index ranges:** line and word indices must be in range for the (tokenized) text.
90+
- Numeric settings are clamped, not rejected: `lineThickness`→1–8, `lineOpacity`→0.2–1, `sizePx`→12–64, line `gapPx`→0–56, pair `gapPx`→12–156.
91+
92+
On invalid input the API returns **HTTP 400** with a JSON body:
93+
94+
```json
95+
{ "error": "alignments[0]: word 4 out of range for line 0 (\"1SG.NOM go.PST.IPFV\" has 2 word(s))" }
96+
```
97+
98+
The error message names the offending field, index, and the tokenized word count — read it to fix indices.
99+
100+
---
101+
102+
## Tokenization and word indices
103+
104+
Word indices in `alignments` and `pairs` refer to **token positions**, so tokenize each line the way the service does before counting:
105+
106+
- **Whitespace always splits.**
107+
- **`tokenSplitChars` (default `.-|`) also splits, and the split character is removed from the rendered output.** `"go.PST.IPFV"` → three tokens `go` `PST` `IPFV` with the dots gone. Override `tokenSplitChars` (e.g. to `"-|"`) to keep characters you want displayed.
108+
- **Punctuation stays attached by default** (the API does not split punctuation). `"Hello, world!"``Hello,`[0] `world!`[1].
109+
- **The merge char `+` (default) joins parts into one token** displayed with a space: `"is+playing"` is one token rendered `is playing`.
110+
- **RTL lines are indexed in reading order** — word 0 is the logically first word (rightmost on screen).
111+
112+
Changing `tokenSplitChars` shifts every line's indices — recount after setting it.
113+
114+
---
115+
116+
## GET /api/align (simple, lines only)
86117

87118
```
88119
GET /api/align?lines=Hello+world&lines=Bonjour+le+monde
89120
```
90121

91-
Returns the same `{ "url": "..." }` response. Useful for opening the editor pre-filled with text, without pre-drawn links. Helpful for verifying word tokenization.
122+
Returns the same `{ "url": "..." }` response. **Lines only** — this endpoint ignores `alignments`, `settings`, and `pairs`; use POST for those. Useful for opening the editor pre-filled with text (no pre-drawn links) and for verifying how a line tokenizes: open the URL and count the word boxes.
92123

93124
---
94125

@@ -107,7 +138,7 @@ Returns the same `{ "url": "..." }` response. Useful for opening the editor pre-
107138
```json
108139
{
109140
"lines": ["Hello world", "Bonjour le monde"],
110-
"alignments": [[0,0,1,0], [0,1,1,2]],
141+
"alignments": [[0,0,1,0], [0,1,1,1], [0,1,1,2]],
111142
"settings": {
112143
"palette": "vivid",
113144
"lineStyle": "straight",

0 commit comments

Comments
 (0)