Skip to content

Commit a5e4a55

Browse files
authored
feat(document-api): anchor tracked changes to text spans in extract (SD-2766) (#2973)
* feat(document-api): anchor tracked changes to text spans in extract (SD-2766) Add `block.textSpans` so consumers can map each tracked change to the exact run of text it covers, instead of guessing from a free-form excerpt that's ambiguous when the same word repeats. Also add `blockIds` and `wordRevisionIds` on each `trackedChanges[]` entry so a review queue or RAG citation flow can navigate back without scanning every block. Suppress the aggregate excerpt for paired replacements (both insert and delete in one entity) where the concatenated value was misleading; spans carry the per-half text. The new fields are all optional and the existing `text` field is unchanged, so non-tracked-change consumers see no diff. * fix(document-api): detect paired tracked changes from observed mark types Address PR review findings: - Generalize paired-replacement detection to suppress the aggregate excerpt for any multi-type entity, keying off mark types observed during the span walk. The previous check looked only at imported `wordRevisionIds`, missing in-app paired edits where no `sourceId` is set. - Rename `rawIdMap` → `canonicalIdByAlias` to match the existing convention in `tracked-change-refs.ts` (the value is the canonical entity id, not a raw mark id). - Document `blockIds` order as document order in the public typedef. - Inline the default value in the `type` JSDoc so readers don't chase the cross-reference. - Gate the visual-inspection log behind `DEBUG_EXTRACT_SAMPLE` so CI doesn't print two pretty-printed JSON blobs per run. - Add unit tests for the in-app paired case (no `sourceId`), span coalescing of identical adjacent marks, and non-tracked marks (bold) coexisting with tracked marks without affecting span boundaries. * test(extract): add real Word-authored DOCX with paired replacements (SD-2766) Adds a 22 KB Word-authored fixture (74 deletes + 104 inserts, all paired replacements with one author and one timestamp) reported by the customer who originally asked for tracked-change anchoring. The test asserts that title-level "Report" -> "Captain's Log" and body- level "get started" -> "set sail" replacements come through as distinct delete/insert spans, that every tracked change reports blockIds, and that span text concatenates back to block.text. A DEBUG_EXTRACT_SAMPLE-gated log prints the rendered <ins>/<del> output for the first five tracked blocks for visual verification.
1 parent d54519c commit a5e4a55

8 files changed

Lines changed: 1194 additions & 39 deletions

File tree

apps/docs/document-api/reference/_generated-manifest.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1031,5 +1031,5 @@
10311031
}
10321032
],
10331033
"marker": "{/* GENERATED FILE: DO NOT EDIT. Regenerate via `pnpm run docapi:sync`. */}",
1034-
"sourceHash": "0bb50c2977e652d32a4c3dd591c774e7d164c013b53f2c951de8573911ecdef6"
1034+
"sourceHash": "cdb0b02e84f6eb7f4db962c177d082e0f89ec48517abc775736a3d17e4da9ba8"
10351035
}

apps/docs/document-api/reference/extract.mdx

Lines changed: 91 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -49,16 +49,18 @@ _No fields._
4949
{
5050
"headingLevel": 1,
5151
"nodeId": "node-def456",
52-
"tableContext": {
53-
"colspan": 1,
54-
"columnIndex": 1,
55-
"parentRowIndex": 1,
56-
"parentTableOrdinal": 1,
57-
"rowIndex": 1,
58-
"rowspan": 1,
59-
"tableOrdinal": 1
60-
},
6152
"text": "Hello, world.",
53+
"textSpans": [
54+
{
55+
"text": "Hello, world.",
56+
"trackedChanges": [
57+
{
58+
"entityId": "entity-789",
59+
"type": "insert"
60+
}
61+
]
62+
}
63+
],
6264
"type": "example"
6365
}
6466
],
@@ -73,10 +75,15 @@ _No fields._
7375
"revision": "example",
7476
"trackedChanges": [
7577
{
76-
"author": "Jane Doe",
78+
"blockIds": [
79+
"example"
80+
],
7781
"entityId": "entity-789",
78-
"excerpt": "Sample excerpt...",
79-
"type": "insert"
82+
"type": "insert",
83+
"wordRevisionIds": {
84+
"delete": "example",
85+
"insert": "example"
86+
}
8087
}
8188
]
8289
}
@@ -112,7 +119,7 @@ _No fields._
112119
"additionalProperties": false,
113120
"properties": {
114121
"headingLevel": {
115-
"description": "Heading level (16). Only present for headings.",
122+
"description": "Heading level (1-6). Only present for headings.",
116123
"type": "integer"
117124
},
118125
"nodeId": {
@@ -168,6 +175,49 @@ _No fields._
168175
"description": "Full plain text content of the block.",
169176
"type": "string"
170177
},
178+
"textSpans": {
179+
"description": "Block text broken into runs with tracked-change marks preserved per run. Present only when the block contains at least one tracked change. Concatenating span text yields `text`.",
180+
"items": {
181+
"additionalProperties": false,
182+
"properties": {
183+
"text": {
184+
"description": "Raw text of the run.",
185+
"type": "string"
186+
},
187+
"trackedChanges": {
188+
"description": "Tracked-change marks applied to this run.",
189+
"items": {
190+
"additionalProperties": false,
191+
"properties": {
192+
"entityId": {
193+
"description": "Tracked change entity ID matching an entry in trackedChanges[].",
194+
"type": "string"
195+
},
196+
"type": {
197+
"enum": [
198+
"insert",
199+
"delete",
200+
"format"
201+
],
202+
"type": "string"
203+
}
204+
},
205+
"required": [
206+
"entityId",
207+
"type"
208+
],
209+
"type": "object"
210+
},
211+
"type": "array"
212+
}
213+
},
214+
"required": [
215+
"text"
216+
],
217+
"type": "object"
218+
},
219+
"type": "array"
220+
},
171221
"type": {
172222
"description": "Block type: paragraph, heading, listItem, image, tableOfContents.",
173223
"type": "string"
@@ -234,25 +284,51 @@ _No fields._
234284
"description": "Change author name.",
235285
"type": "string"
236286
},
287+
"blockIds": {
288+
"description": "Block IDs whose textSpans carry this change.",
289+
"items": {
290+
"type": "string"
291+
},
292+
"type": "array"
293+
},
237294
"date": {
238295
"description": "Change date (ISO string).",
239296
"type": "string"
240297
},
241298
"entityId": {
242-
"description": "Tracked change entity ID — pass to scrollToElement() for navigation.",
299+
"description": "Tracked change entity ID. Pass to scrollToElement() for navigation.",
243300
"type": "string"
244301
},
245302
"excerpt": {
246-
"description": "Short text excerpt of the changed content.",
303+
"description": "Short text excerpt of the changed content. Omitted for paired replacements; read block.textSpans for the per-half text.",
247304
"type": "string"
248305
},
249306
"type": {
307+
"description": "Aggregate type at the entity level. In paired replacement mode, a delete+insert pair shares one entity and this collapses to 'insert'; per-half type lives on block.textSpans[].trackedChanges[].",
250308
"enum": [
251309
"insert",
252310
"delete",
253311
"format"
254312
],
255313
"type": "string"
314+
},
315+
"wordRevisionIds": {
316+
"additionalProperties": false,
317+
"properties": {
318+
"delete": {
319+
"description": "Original OOXML w:id from a w:del mark.",
320+
"type": "string"
321+
},
322+
"format": {
323+
"description": "Original OOXML w:id from a w:rPrChange mark.",
324+
"type": "string"
325+
},
326+
"insert": {
327+
"description": "Original OOXML w:id from a w:ins mark.",
328+
"type": "string"
329+
}
330+
},
331+
"type": "object"
256332
}
257333
},
258334
"required": [

packages/document-api/src/contract/schemas.ts

Lines changed: 51 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2968,7 +2968,32 @@ const operationSchemas: Record<OperationId, OperationSchemaSet> = {
29682968
description: 'Block type: paragraph, heading, listItem, image, tableOfContents.',
29692969
},
29702970
text: { type: 'string', description: 'Full plain text content of the block.' },
2971-
headingLevel: { type: 'integer', description: 'Heading level (1–6). Only present for headings.' },
2971+
textSpans: {
2972+
type: 'array',
2973+
description:
2974+
'Block text broken into runs with tracked-change marks preserved per run. Present only when the block contains at least one tracked change. Concatenating span text yields `text`.',
2975+
items: objectSchema(
2976+
{
2977+
text: { type: 'string', description: 'Raw text of the run.' },
2978+
trackedChanges: {
2979+
type: 'array',
2980+
description: 'Tracked-change marks applied to this run.',
2981+
items: objectSchema(
2982+
{
2983+
entityId: {
2984+
type: 'string',
2985+
description: 'Tracked change entity ID matching an entry in trackedChanges[].',
2986+
},
2987+
type: { type: 'string', enum: ['insert', 'delete', 'format'] },
2988+
},
2989+
['entityId', 'type'],
2990+
),
2991+
},
2992+
},
2993+
['text'],
2994+
),
2995+
},
2996+
headingLevel: { type: 'integer', description: 'Heading level (1-6). Only present for headings.' },
29722997
tableContext: objectSchema(
29732998
{
29742999
tableOrdinal: {
@@ -3024,10 +3049,32 @@ const operationSchemas: Record<OperationId, OperationSchemaSet> = {
30243049
{
30253050
entityId: {
30263051
type: 'string',
3027-
description: 'Tracked change entity ID — pass to scrollToElement() for navigation.',
3052+
description: 'Tracked change entity ID. Pass to scrollToElement() for navigation.',
3053+
},
3054+
type: {
3055+
type: 'string',
3056+
enum: ['insert', 'delete', 'format'],
3057+
description:
3058+
"Aggregate type at the entity level. In paired replacement mode, a delete+insert pair shares one entity and this collapses to 'insert'; per-half type lives on block.textSpans[].trackedChanges[].",
3059+
},
3060+
blockIds: {
3061+
type: 'array',
3062+
description: 'Block IDs whose textSpans carry this change.',
3063+
items: { type: 'string' },
3064+
},
3065+
wordRevisionIds: objectSchema(
3066+
{
3067+
insert: { type: 'string', description: 'Original OOXML w:id from a w:ins mark.' },
3068+
delete: { type: 'string', description: 'Original OOXML w:id from a w:del mark.' },
3069+
format: { type: 'string', description: 'Original OOXML w:id from a w:rPrChange mark.' },
3070+
},
3071+
[],
3072+
),
3073+
excerpt: {
3074+
type: 'string',
3075+
description:
3076+
'Short text excerpt of the changed content. Omitted for paired replacements; read block.textSpans for the per-half text.',
30283077
},
3029-
type: { type: 'string', enum: ['insert', 'delete', 'format'] },
3030-
excerpt: { type: 'string', description: 'Short text excerpt of the changed content.' },
30313078
author: { type: 'string', description: 'Change author name.' },
30323079
date: { type: 'string', description: 'Change date (ISO string).' },
30333080
},

packages/document-api/src/types/extract.types.ts

Lines changed: 75 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
import type { CommentStatus, TrackChangeType } from './index.js';
1+
import type { CommentStatus, TrackChangeType, TrackChangeWordRevisionIds } from './index.js';
22

33
// ---------------------------------------------------------------------------
44
// extract
@@ -34,6 +34,40 @@ export interface ExtractTableContext {
3434
colspan: number;
3535
}
3636

37+
/**
38+
* Reference to a tracked change applied to one text span.
39+
*
40+
* The `entityId` matches an entry in `ExtractResult.trackedChanges`, so
41+
* consumers can look up author/date or pass it to `scrollToElement()`.
42+
*/
43+
export interface ExtractTextSpanTrackedChange {
44+
/** Tracked change entity ID. */
45+
entityId: string;
46+
/** The mark type carried on this run: insert, delete, or format. */
47+
type: TrackChangeType;
48+
}
49+
50+
/**
51+
* A contiguous run of text within a block, optionally tagged with the
52+
* tracked-change marks that apply to it.
53+
*
54+
* Spans tile the block's text exactly:
55+
* `block.textSpans.map(s => s.text).join('') === block.text`.
56+
*
57+
* Adjacent runs are coalesced when their `trackedChanges` sets are identical
58+
* (same `(entityId, type)` pairs, ignoring order). Plain text with no tracked
59+
* marks is one or more spans with `trackedChanges` omitted.
60+
*
61+
* A single span can carry multiple entries when overlapping marks apply, for
62+
* example a run that is both inserted and bold-tracked.
63+
*/
64+
export interface ExtractTextSpan {
65+
/** Raw text of the run. Tiles `block.text` when concatenated in order. */
66+
text: string;
67+
/** Tracked-change marks applied to this run. Omitted when none apply. */
68+
trackedChanges?: ExtractTextSpanTrackedChange[];
69+
}
70+
3771
/**
3872
* One addressable unit of document content.
3973
*
@@ -53,6 +87,12 @@ export interface ExtractBlock {
5387
type: string;
5488
/** Full plain text content of the block. */
5589
text: string;
90+
/**
91+
* Structured reconstruction of the block's text with tracked-change marks
92+
* preserved per run. Present only when the block contains at least one
93+
* tracked change. When concatenated, span text equals `text`.
94+
*/
95+
textSpans?: ExtractTextSpan[];
5696
/** Heading level (1-6). Only present for headings. */
5797
headingLevel?: number;
5898
/** Table coordinates. Only present for blocks inside a table cell. */
@@ -75,11 +115,42 @@ export interface ExtractComment {
75115
}
76116

77117
export interface ExtractTrackedChange {
78-
/** Tracked change entity ID — pass to `scrollToElement()` for navigation. */
118+
/** Tracked change entity ID. Pass to `scrollToElement()` for navigation. */
79119
entityId: string;
80-
/** Change type. */
120+
/**
121+
* Change type at the entity level.
122+
*
123+
* In paired replacement mode (the default — set
124+
* `modules.trackChanges.replacements: 'independent'` for one entity per
125+
* `<w:ins>` / `<w:del>` instead), a delete + insert pair shares one entity
126+
* and the aggregate `type` collapses to `'insert'`. Per-half information
127+
* lives on `block.textSpans[].trackedChanges[].type`, which is the source
128+
* of truth for what each run actually represents.
129+
*
130+
* In independent mode every revision is its own entity and `type` is the
131+
* entity's only type.
132+
*/
81133
type: TrackChangeType;
82-
/** Short text excerpt of the changed content. */
134+
/**
135+
* Block IDs whose `textSpans` carry this change, in document order. Lets
136+
* consumers iterate a single tracked change without scanning every block.
137+
* Omitted when the resolver could not match the change to any block (e.g.
138+
* orphan marks).
139+
*/
140+
blockIds?: string[];
141+
/**
142+
* Original OOXML `w:id` values (per ECMA-376 §17.13.5) for the marks that
143+
* make up this entity. In paired mode a replacement populates both
144+
* `insert` and `delete`. In independent mode only one key is set. Useful
145+
* for spec-aware consumers that need to map back to the source document.
146+
*/
147+
wordRevisionIds?: TrackChangeWordRevisionIds;
148+
/**
149+
* Short text excerpt of the changed content. Omitted for paired
150+
* replacements: the underlying text spans both halves and any single
151+
* string would either concatenate them (misleading) or pick a side
152+
* arbitrarily. Read `block.textSpans` for the per-half text instead.
153+
*/
83154
excerpt?: string;
84155
/** Change author name. */
85156
author?: string;

0 commit comments

Comments
 (0)