Skip to content

Commit d0a36c2

Browse files
authored
fix(extract): return tables as paragraph-granular blocks (SD-2672) (#2925)
* fix(extract): return tables as paragraph-granular blocks (SD-2672) doc.extract() was flattening tables into one joined string, which broke RAG chunking and made table citations unreachable via scrollToElement. Walk tables directly and emit one block per paragraph-like descendant of each origin cell, tagged with tableContext so consumers can group back to cell, row, or whole table. - gridBefore/gridAfter placeholder cells are skipped via the __placeholder attr; they are layout artifacts with no user content. - Block SDTs (structuredContentBlock) are transparent, so tables wrapped in content controls are not re-flattened through the wrapper's textContent. - Cell paths use physical row-and-cell child indexes so deterministic fallback nodeIds agree with buildBlockIndex, keeping the scrollToElement round-trip stable for paragraphs that lack paraId and sdBlockId inside horizontally merged tables. Tested: 13 behavior tests (7 existing SD-2525 + 6 new SD-2672), 5 new adapter unit tests, plus the full document-api-adapters suite (3105 tests) and document-api bun suite (1362 tests). * fix(extract): recurse through unrecognized block wrappers (SD-2672) The new table walker only emitted blocks for recognized types and silently dropped anything else, including their block children. That regressed coverage versus the old textContent walk for `documentSection`, `documentPartObject`, and `shapeContainer`, which all declare block-level content but aren't in EMITTABLE_BLOCK_TYPES. Treat any unrecognized block with block-level children as transparent and recurse into it, so paragraphs nested inside these wrappers still surface with their enclosing tableContext. Adds a unit test covering a `documentSection` inside a table cell. * test(extract): add DOCX-import-driven coverage for table edge cases (SD-2672) The adapter unit tests hit the algorithm via schema-constructed PM docs, which skips the importer entirely. This adds a second layer of tests that load real Word-authored .docx files, run them through the full import pipeline, and assert extract output. Closes the gap the code review flagged for a customer-facing legal RAG contract. Fixtures authored via Word COM + local OOXML patching: - sd-2672-plain-3x3.docx: baseline table, no merges or placeholders - sd-2672-merged-table.docx: colspan=2 and rowspan=2 anchors - sd-2672-rtl-table.docx: bidiVisual RTL table - sd-2672-gridbefore-vmerge.docx: w:gridBefore + w:vMerge=restart/continue - sd-2672-sdt-table.docx: table wrapped in a w:sdt block (content control) - sd-2672-nested-table.docx: 2x2 table inside cell (1,1) of outer table - sd-2672-multipara-cell.docx: cell (0,0) with two paragraphs The build-sd-2672-fixtures.mjs script regenerates the patched variants from the Word-authored base, using JSZip + regex/XmlDocument surgery. Tests assert: per-cell content lands at correct logical grid coords, merged anchors carry rowspan/colspan, RTL tables still report columns 0..N-1, gridBefore placeholders don't emit phantom blocks, SDT wrappers are transparent, nested tables get a fresh tableOrdinal with parent coordinates, multi-paragraph cells emit one block per paragraph with shared tableContext, and scrollToElement round-trips a merged-cell paragraph nodeId. * chore(tests): drop SD-2672 fixture build script The script was added alongside the fixtures to regenerate the OOXML-patched variants from a Word-authored base. It isn't carrying its weight: fixtures are committed as static binaries, the regex-based XML patching is fragile to Word COM output changes, and the commit history already documents how each fixture was constructed. If we need a new edge-case fixture later, hand-authoring it once is simpler than maintaining a generator. * chore(tests): drop stale script reference in extract-docx error
1 parent 9c6ccb0 commit d0a36c2

15 files changed

Lines changed: 1125 additions & 26 deletions

File tree

apps/docs/document-api/reference/_generated-manifest.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1018,5 +1018,5 @@
10181018
}
10191019
],
10201020
"marker": "{/* GENERATED FILE: DO NOT EDIT. Regenerate via `pnpm run docapi:sync`. */}",
1021-
"sourceHash": "e74a36833ec8587b67447a79517de348cfc9b4bba1c564729c184f6d5464a018"
1021+
"sourceHash": "c8670fb494b56c19fbd09a7bada35974fbb3c22d938f6a5e01eee6e8467961c0"
10221022
}

apps/docs/document-api/reference/extract.mdx

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,15 @@ _No fields._
4949
{
5050
"headingLevel": 1,
5151
"nodeId": "node-def456",
52+
"tableContext": {
53+
"colspan": 1,
54+
"columnIndex": 1,
55+
"parentRowIndex": 1,
56+
"parentTableOrdinal": 1,
57+
"rowIndex": 1,
58+
"rowspan": 1,
59+
"tableOrdinal": 1
60+
},
5261
"text": "Hello, world.",
5362
"type": "example"
5463
}
@@ -110,12 +119,57 @@ _No fields._
110119
"description": "Stable block ID — pass to scrollToElement() for navigation.",
111120
"type": "string"
112121
},
122+
"tableContext": {
123+
"additionalProperties": false,
124+
"properties": {
125+
"colspan": {
126+
"description": "Number of columns the cell spans.",
127+
"type": "integer"
128+
},
129+
"columnIndex": {
130+
"description": "0-based logical grid column, not the row child order.",
131+
"type": "integer"
132+
},
133+
"parentColumnIndex": {
134+
"description": "Column index in the parent table. Set with parentTableOrdinal.",
135+
"type": "integer"
136+
},
137+
"parentRowIndex": {
138+
"description": "Row index in the parent table. Set with parentTableOrdinal.",
139+
"type": "integer"
140+
},
141+
"parentTableOrdinal": {
142+
"description": "Ordinal of the parent table when the containing table is nested.",
143+
"type": "integer"
144+
},
145+
"rowIndex": {
146+
"description": "0-based row index of the containing cell.",
147+
"type": "integer"
148+
},
149+
"rowspan": {
150+
"description": "Number of rows the cell spans.",
151+
"type": "integer"
152+
},
153+
"tableOrdinal": {
154+
"description": "0-based table ordinal, unique within one extract() result.",
155+
"type": "integer"
156+
}
157+
},
158+
"required": [
159+
"tableOrdinal",
160+
"rowIndex",
161+
"columnIndex",
162+
"rowspan",
163+
"colspan"
164+
],
165+
"type": "object"
166+
},
113167
"text": {
114168
"description": "Full plain text content of the block.",
115169
"type": "string"
116170
},
117171
"type": {
118-
"description": "Block type: paragraph, heading, listItem, table, image, etc.",
172+
"description": "Block type: paragraph, heading, listItem, image, tableOfContents.",
119173
"type": "string"
120174
}
121175
},

packages/document-api/src/contract/schemas.ts

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2963,9 +2963,40 @@ const operationSchemas: Record<OperationId, OperationSchemaSet> = {
29632963
items: objectSchema(
29642964
{
29652965
nodeId: { type: 'string', description: 'Stable block ID — pass to scrollToElement() for navigation.' },
2966-
type: { type: 'string', description: 'Block type: paragraph, heading, listItem, table, image, etc.' },
2966+
type: {
2967+
type: 'string',
2968+
description: 'Block type: paragraph, heading, listItem, image, tableOfContents.',
2969+
},
29672970
text: { type: 'string', description: 'Full plain text content of the block.' },
29682971
headingLevel: { type: 'integer', description: 'Heading level (1–6). Only present for headings.' },
2972+
tableContext: objectSchema(
2973+
{
2974+
tableOrdinal: {
2975+
type: 'integer',
2976+
description: '0-based table ordinal, unique within one extract() result.',
2977+
},
2978+
parentTableOrdinal: {
2979+
type: 'integer',
2980+
description: 'Ordinal of the parent table when the containing table is nested.',
2981+
},
2982+
parentRowIndex: {
2983+
type: 'integer',
2984+
description: 'Row index in the parent table. Set with parentTableOrdinal.',
2985+
},
2986+
parentColumnIndex: {
2987+
type: 'integer',
2988+
description: 'Column index in the parent table. Set with parentTableOrdinal.',
2989+
},
2990+
rowIndex: { type: 'integer', description: '0-based row index of the containing cell.' },
2991+
columnIndex: {
2992+
type: 'integer',
2993+
description: '0-based logical grid column, not the row child order.',
2994+
},
2995+
rowspan: { type: 'integer', description: 'Number of rows the cell spans.' },
2996+
colspan: { type: 'integer', description: 'Number of columns the cell spans.' },
2997+
},
2998+
['tableOrdinal', 'rowIndex', 'columnIndex', 'rowspan', 'colspan'],
2999+
),
29693000
},
29703001
['nodeId', 'type', 'text'],
29713002
),

packages/document-api/src/types/extract.types.ts

Lines changed: 47 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,59 @@ import type { CommentStatus, TrackChangeType } from './index.js';
44
// extract
55
// ---------------------------------------------------------------------------
66

7+
/**
8+
* Table coordinates for an {@link ExtractBlock} that lives inside a table cell.
9+
*
10+
* Blocks inside tables are extracted at paragraph granularity (one entry per
11+
* paragraph/heading/listItem/image/sdt/tableOfContents in each cell). Group
12+
* by these fields to reconstruct cells, rows, or whole tables:
13+
*
14+
* - cell: group by `tableOrdinal + rowIndex + columnIndex`
15+
* - row: group by `tableOrdinal + rowIndex`
16+
* - table: group by `tableOrdinal`
17+
*/
18+
export interface ExtractTableContext {
19+
/** 0-based table ordinal, unique within one `extract()` result. */
20+
tableOrdinal: number;
21+
/** Ordinal of the parent table when this block is inside a nested table. */
22+
parentTableOrdinal?: number;
23+
/** Row index within the parent table. Only set with `parentTableOrdinal`. */
24+
parentRowIndex?: number;
25+
/** Column index within the parent table. Only set with `parentTableOrdinal`. */
26+
parentColumnIndex?: number;
27+
/** 0-based row index of the containing cell. */
28+
rowIndex: number;
29+
/** 0-based logical grid column of the containing cell, not the row's child order. */
30+
columnIndex: number;
31+
/** Number of rows the containing cell spans. 1 for unmerged cells. */
32+
rowspan: number;
33+
/** Number of columns the containing cell spans. 1 for unmerged cells. */
34+
colspan: number;
35+
}
36+
37+
/**
38+
* One addressable unit of document content.
39+
*
40+
* Extraction is paragraph-granular: tables are NOT returned as a single block.
41+
* Paragraph-like descendants of table cells are emitted individually with
42+
* `tableContext` attached.
43+
*
44+
* Block SDTs (structured document tags / content controls) are transparent:
45+
* their children emit individually as if they were direct children of the
46+
* enclosing container. No wrapper `sdt` block is emitted. This prevents
47+
* SDT-wrapped tables from re-flattening through the wrapper's textContent.
48+
*/
749
export interface ExtractBlock {
8-
/** Stable block ID — pass to `scrollToElement()` for navigation. */
50+
/** Stable block ID. Pass to `scrollToElement()` for navigation. */
951
nodeId: string;
10-
/** Block type: paragraph, heading, listItem, table, image, etc. */
52+
/** Block type: paragraph, heading, listItem, image, tableOfContents. */
1153
type: string;
1254
/** Full plain text content of the block. */
1355
text: string;
14-
/** Heading level (16). Only present for headings. */
56+
/** Heading level (1-6). Only present for headings. */
1557
headingLevel?: number;
58+
/** Table coordinates. Only present for blocks inside a table cell. */
59+
tableContext?: ExtractTableContext;
1660
}
1761

1862
export interface ExtractComment {

0 commit comments

Comments
 (0)