Commit d0a36c2
authored
fix(extract): return tables as paragraph-granular blocks (SD-2672) (#2925)
* fix(extract): return tables as paragraph-granular blocks (SD-2672)
doc.extract() was flattening tables into one joined string, which broke
RAG chunking and made table citations unreachable via scrollToElement.
Walk tables directly and emit one block per paragraph-like descendant
of each origin cell, tagged with tableContext so consumers can group
back to cell, row, or whole table.
- gridBefore/gridAfter placeholder cells are skipped via the
__placeholder attr; they are layout artifacts with no user content.
- Block SDTs (structuredContentBlock) are transparent, so tables
wrapped in content controls are not re-flattened through the
wrapper's textContent.
- Cell paths use physical row-and-cell child indexes so deterministic
fallback nodeIds agree with buildBlockIndex, keeping the
scrollToElement round-trip stable for paragraphs that lack paraId
and sdBlockId inside horizontally merged tables.
Tested: 13 behavior tests (7 existing SD-2525 + 6 new SD-2672),
5 new adapter unit tests, plus the full document-api-adapters suite
(3105 tests) and document-api bun suite (1362 tests).
* fix(extract): recurse through unrecognized block wrappers (SD-2672)
The new table walker only emitted blocks for recognized types and
silently dropped anything else, including their block children. That
regressed coverage versus the old textContent walk for `documentSection`,
`documentPartObject`, and `shapeContainer`, which all declare
block-level content but aren't in EMITTABLE_BLOCK_TYPES. Treat any
unrecognized block with block-level children as transparent and recurse
into it, so paragraphs nested inside these wrappers still surface with
their enclosing tableContext. Adds a unit test covering a
`documentSection` inside a table cell.
* test(extract): add DOCX-import-driven coverage for table edge cases (SD-2672)
The adapter unit tests hit the algorithm via schema-constructed PM docs,
which skips the importer entirely. This adds a second layer of tests
that load real Word-authored .docx files, run them through the full
import pipeline, and assert extract output. Closes the gap the code
review flagged for a customer-facing legal RAG contract.
Fixtures authored via Word COM + local OOXML patching:
- sd-2672-plain-3x3.docx: baseline table, no merges or placeholders
- sd-2672-merged-table.docx: colspan=2 and rowspan=2 anchors
- sd-2672-rtl-table.docx: bidiVisual RTL table
- sd-2672-gridbefore-vmerge.docx: w:gridBefore + w:vMerge=restart/continue
- sd-2672-sdt-table.docx: table wrapped in a w:sdt block (content control)
- sd-2672-nested-table.docx: 2x2 table inside cell (1,1) of outer table
- sd-2672-multipara-cell.docx: cell (0,0) with two paragraphs
The build-sd-2672-fixtures.mjs script regenerates the patched variants
from the Word-authored base, using JSZip + regex/XmlDocument surgery.
Tests assert: per-cell content lands at correct logical grid coords,
merged anchors carry rowspan/colspan, RTL tables still report columns
0..N-1, gridBefore placeholders don't emit phantom blocks, SDT wrappers
are transparent, nested tables get a fresh tableOrdinal with parent
coordinates, multi-paragraph cells emit one block per paragraph with
shared tableContext, and scrollToElement round-trips a merged-cell
paragraph nodeId.
* chore(tests): drop SD-2672 fixture build script
The script was added alongside the fixtures to regenerate the OOXML-patched
variants from a Word-authored base. It isn't carrying its weight: fixtures
are committed as static binaries, the regex-based XML patching is fragile
to Word COM output changes, and the commit history already documents how
each fixture was constructed. If we need a new edge-case fixture later,
hand-authoring it once is simpler than maintaining a generator.
* chore(tests): drop stale script reference in extract-docx error1 parent 9c6ccb0 commit d0a36c2
15 files changed
Lines changed: 1125 additions & 26 deletions
File tree
- apps/docs/document-api/reference
- packages
- document-api/src
- contract
- types
- super-editor/src/editors/v1/document-api-adapters
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1018 | 1018 | | |
1019 | 1019 | | |
1020 | 1020 | | |
1021 | | - | |
| 1021 | + | |
1022 | 1022 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
52 | 61 | | |
53 | 62 | | |
54 | 63 | | |
| |||
110 | 119 | | |
111 | 120 | | |
112 | 121 | | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
113 | 167 | | |
114 | 168 | | |
115 | 169 | | |
116 | 170 | | |
117 | 171 | | |
118 | | - | |
| 172 | + | |
119 | 173 | | |
120 | 174 | | |
121 | 175 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2963 | 2963 | | |
2964 | 2964 | | |
2965 | 2965 | | |
2966 | | - | |
| 2966 | + | |
| 2967 | + | |
| 2968 | + | |
| 2969 | + | |
2967 | 2970 | | |
2968 | 2971 | | |
| 2972 | + | |
| 2973 | + | |
| 2974 | + | |
| 2975 | + | |
| 2976 | + | |
| 2977 | + | |
| 2978 | + | |
| 2979 | + | |
| 2980 | + | |
| 2981 | + | |
| 2982 | + | |
| 2983 | + | |
| 2984 | + | |
| 2985 | + | |
| 2986 | + | |
| 2987 | + | |
| 2988 | + | |
| 2989 | + | |
| 2990 | + | |
| 2991 | + | |
| 2992 | + | |
| 2993 | + | |
| 2994 | + | |
| 2995 | + | |
| 2996 | + | |
| 2997 | + | |
| 2998 | + | |
| 2999 | + | |
2969 | 3000 | | |
2970 | 3001 | | |
2971 | 3002 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
7 | 49 | | |
8 | | - | |
| 50 | + | |
9 | 51 | | |
10 | | - | |
| 52 | + | |
11 | 53 | | |
12 | 54 | | |
13 | 55 | | |
14 | | - | |
| 56 | + | |
15 | 57 | | |
| 58 | + | |
| 59 | + | |
16 | 60 | | |
17 | 61 | | |
18 | 62 | | |
| |||
0 commit comments