You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(document-api): implement doc.extract() for RAG content extraction (SD-2525) (#2774)
* feat(document-api): implement doc.extract() for RAG content extraction (SD-2525)
Single API method that returns all document content with stable IDs —
blocks with full text, comments with anchored block references, and
tracked changes with excerpts. Every ID works directly with
scrollToElement() for citation navigation.
* fix(document-api): review fixes — heading regex, schema required, tests
- Use canonical getHeadingLevel() instead of divergent local regex
- Reuse collectTopLevelBlocks() instead of duplicating block traversal
- Add required fields to extract output JSON schema
- Remove fixture-only unit tests that don't call executeExtract
- Add behavior tests: headings, comments, tracked changes, scrollToElement round-trip
* fix(tests): remove superdoc.click() — fixture uses type() for focus
* fix(cli): add extract operation hints for CLI/SDK wiring
No ID is guaranteed to survive all Microsoft Word round-trips. Re-extract addresses after major external edits or transformations, since Word (or other tools) may rewrite paragraph IDs and SuperDoc may rewrite duplicate IDs on import.
309
309
</Warning>
310
310
311
+
## Content extraction for RAG
312
+
313
+
`doc.extract()` returns all document content in one call — blocks with full text, comments, and tracked changes. Each item has a stable ID that works directly with [`scrollToElement`](/core/superdoc/methods#scrolltoelement).
All IDs from `doc.extract()` work directly with `scrollToElement()` — no conversion needed. For DOCX-imported content, block `nodeId` values are stable across sessions.
365
+
</Info>
366
+
311
367
## Read document counts
312
368
313
369
`doc.info()` returns a snapshot of current document statistics including word, character, paragraph, heading, table, image, comment, tracked-change, SDT-field, and list counts.
Copy file name to clipboardExpand all lines: apps/docs/guides/general/stable-navigation.mdx
+10-11Lines changed: 10 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,18 +13,17 @@ SuperDoc has two navigation approaches depending on your use case:
13
13
14
14
## Navigate by element ID
15
15
16
-
`scrollToElement` takes any element ID — paragraph, comment, or tracked change — and scrolls to it. The ID comes from the Document API.
16
+
`scrollToElement` takes any element ID — paragraph, comment, or tracked change — and scrolls to it. Use `doc.extract()` to get all IDs at once, or `query.match` for targeted lookups.
@@ -33,7 +32,7 @@ This is the approach to use for:
33
32
-**Search results** — scroll to the matching paragraph
34
33
-**Cross-session addressing** — IDs from DOCX-imported content survive reloads
35
34
36
-
For the full cross-session pattern, see [cross-session block addressing](/document-api/common-workflows#cross-session-block-addressing).
35
+
For the full extraction pattern, see [content extraction for RAG](/document-api/common-workflows#content-extraction-for-rag). For the cross-session pattern, see [cross-session block addressing](/document-api/common-workflows#cross-session-block-addressing).
37
36
38
37
## Track nodes during edits
39
38
@@ -62,7 +61,7 @@ function goToLink(link) {
62
61
63
62
## Best practices
64
63
65
-
- Use `scrollToElement` when you have an element ID from the Document API.
64
+
- Use `scrollToElement` when you have an element ID from `doc.extract()` or the Document API.
66
65
- Use `PositionTracker` when you need to follow nodes that move during edits.
67
66
- For cross-session use, store `nodeId` values (not `sdBlockId` — those regenerate on each open).
68
67
- Handle missing targets gracefully — both APIs return `false` if the element no longer exists.
'Extract all document content with stable IDs for RAG pipelines. Returns blocks with full text, comments, and tracked changes — each with an ID compatible with scrollToElement().',
655
+
expectedResult:
656
+
'Returns an ExtractResult with blocks (nodeId, type, text, headingLevel), comments (entityId, text, anchoredText, blockId, status, author), tracked changes (entityId, type, excerpt, author, date), and revision.',
0 commit comments