Skip to content

Commit c2f2577

Browse files
authored
feat(document-api): implement doc.extract() for RAG content extraction (SD-2525) (#2774)
* feat(document-api): implement doc.extract() for RAG content extraction (SD-2525) Single API method that returns all document content with stable IDs — blocks with full text, comments with anchored block references, and tracked changes with excerpts. Every ID works directly with scrollToElement() for citation navigation. * fix(document-api): review fixes — heading regex, schema required, tests - Use canonical getHeadingLevel() instead of divergent local regex - Reuse collectTopLevelBlocks() instead of duplicating block traversal - Add required fields to extract output JSON schema - Remove fixture-only unit tests that don't call executeExtract - Add behavior tests: headings, comments, tracked changes, scrollToElement round-trip * fix(tests): remove superdoc.click() — fixture uses type() for focus * fix(cli): add extract operation hints for CLI/SDK wiring
1 parent 8f55848 commit c2f2577

File tree

16 files changed

+486
-12
lines changed

16 files changed

+486
-12
lines changed

apps/cli/src/cli/operation-hints.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ export const SUCCESS_VERB: Record<CliExposedOperationId, string> = {
8080
getMarkdown: 'extracted markdown',
8181
getHtml: 'extracted html',
8282
info: 'retrieved info',
83+
extract: 'extracted content',
8384
clearContent: 'cleared document content',
8485
insert: 'inserted text',
8586
replace: 'replaced text',
@@ -255,6 +256,7 @@ export const OUTPUT_FORMAT: Record<CliExposedOperationId, OutputFormat> = {
255256
getMarkdown: 'plain',
256257
getHtml: 'plain',
257258
info: 'documentInfo',
259+
extract: 'plain',
258260
clearContent: 'receipt',
259261
insert: 'mutationReceipt',
260262
replace: 'mutationReceipt',
@@ -411,6 +413,7 @@ export const RESPONSE_ENVELOPE_KEY: Record<CliExposedOperationId, string | null>
411413
getMarkdown: 'markdown',
412414
getHtml: 'html',
413415
info: null,
416+
extract: null,
414417
clearContent: 'receipt',
415418
insert: null,
416419
replace: null,
@@ -608,6 +611,7 @@ export const OPERATION_FAMILY: Record<CliExposedOperationId, OperationFamily> =
608611
getMarkdown: 'query',
609612
getHtml: 'query',
610613
info: 'general',
614+
extract: 'general',
611615
clearContent: 'general',
612616
insert: 'textMutation',
613617
replace: 'textMutation',

apps/docs/document-api/common-workflows.mdx

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -308,6 +308,62 @@ await superdoc.scrollToElement(trackedChangeEntityId);
308308
No ID is guaranteed to survive all Microsoft Word round-trips. Re-extract addresses after major external edits or transformations, since Word (or other tools) may rewrite paragraph IDs and SuperDoc may rewrite duplicate IDs on import.
309309
</Warning>
310310

311+
## Content extraction for RAG
312+
313+
`doc.extract()` returns all document content in one call — blocks with full text, comments, and tracked changes. Each item has a stable ID that works directly with [`scrollToElement`](/core/superdoc/methods#scrolltoelement).
314+
315+
```ts
316+
const content = editor.doc.extract();
317+
318+
// Every block in document order, with full text
319+
for (const block of content.blocks) {
320+
console.log(block.nodeId, block.type, block.text);
321+
// → '5AF80E61', 'heading', 'Chapter 1: Introduction'
322+
// → '17FBFA43', 'paragraph', 'This is the opening paragraph...'
323+
}
324+
325+
// Comments anchored to blocks
326+
for (const comment of content.comments) {
327+
console.log(comment.entityId, comment.blockId, comment.text);
328+
}
329+
330+
// Tracked changes
331+
for (const tc of content.trackedChanges) {
332+
console.log(tc.entityId, tc.type, tc.excerpt);
333+
}
334+
```
335+
336+
### RAG pipeline pattern
337+
338+
Extract content, chunk it, store the IDs, and navigate back on click:
339+
340+
```ts
341+
// 1. Extract all content
342+
const { blocks } = editor.doc.extract();
343+
344+
// 2. Chunk and embed (your pipeline)
345+
const chunks = blocks
346+
.filter((b) => b.text.length > 0)
347+
.map((b) => ({
348+
id: b.nodeId,
349+
text: b.text,
350+
type: b.type,
351+
headingLevel: b.headingLevel,
352+
}));
353+
const embeddings = await embedChunks(chunks);
354+
355+
// 3. Store embeddings with nodeIds
356+
await vectorStore.upsert(embeddings);
357+
358+
// 4. Later — user clicks a citation
359+
const citation = await vectorStore.query(userQuestion);
360+
await superdoc.scrollToElement(citation.id);
361+
```
362+
363+
<Info>
364+
All IDs from `doc.extract()` work directly with `scrollToElement()` — no conversion needed. For DOCX-imported content, block `nodeId` values are stable across sessions.
365+
</Info>
366+
311367
## Read document counts
312368

313369
`doc.info()` returns a snapshot of current document statistics including word, character, paragraph, heading, table, image, comment, tracked-change, SDT-field, and list counts.

apps/docs/guides/general/stable-navigation.mdx

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,18 +13,17 @@ SuperDoc has two navigation approaches depending on your use case:
1313

1414
## Navigate by element ID
1515

16-
`scrollToElement` takes any element ID — paragraph, comment, or tracked change — and scrolls to it. The ID comes from the Document API.
16+
`scrollToElement` takes any element ID — paragraph, comment, or tracked change — and scrolls to it. Use `doc.extract()` to get all IDs at once, or `query.match` for targeted lookups.
1717

1818
```javascript
19-
// Get an element's ID
20-
const match = editor.doc.query.match({
21-
select: { type: 'text', pattern: 'Introduction', mode: 'contains' },
22-
require: 'first',
23-
});
24-
const nodeId = match.items[0].address.nodeId;
19+
// Extract all content with stable IDs
20+
const { blocks, comments } = editor.doc.extract();
21+
22+
// Navigate to any block
23+
await superdoc.scrollToElement(blocks[0].nodeId);
2524

26-
// Navigate to it — works for paragraphs, comments, tracked changes
27-
await superdoc.scrollToElement(nodeId);
25+
// Navigate to a comment
26+
await superdoc.scrollToElement(comments[0].entityId);
2827
```
2928

3029
This is the approach to use for:
@@ -33,7 +32,7 @@ This is the approach to use for:
3332
- **Search results** — scroll to the matching paragraph
3433
- **Cross-session addressing** — IDs from DOCX-imported content survive reloads
3534

36-
For the full cross-session pattern, see [cross-session block addressing](/document-api/common-workflows#cross-session-block-addressing).
35+
For the full extraction pattern, see [content extraction for RAG](/document-api/common-workflows#content-extraction-for-rag). For the cross-session pattern, see [cross-session block addressing](/document-api/common-workflows#cross-session-block-addressing).
3736

3837
## Track nodes during edits
3938

@@ -62,7 +61,7 @@ function goToLink(link) {
6261
6362
## Best practices
6463
65-
- Use `scrollToElement` when you have an element ID from the Document API.
64+
- Use `scrollToElement` when you have an element ID from `doc.extract()` or the Document API.
6665
- Use `PositionTracker` when you need to follow nodes that move during edits.
6766
- For cross-session use, store `nodeId` values (not `sdBlockId` — those regenerate on each open).
6867
- Handle missing targets gracefully — both APIs return `false` if the element no longer exists.

packages/document-api/src/contract/operation-definitions.ts

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -648,6 +648,19 @@ export const OPERATION_DEFINITIONS = {
648648
intentGroup: 'get_content',
649649
intentAction: 'info',
650650
},
651+
extract: {
652+
memberPath: 'extract',
653+
description:
654+
'Extract all document content with stable IDs for RAG pipelines. Returns blocks with full text, comments, and tracked changes — each with an ID compatible with scrollToElement().',
655+
expectedResult:
656+
'Returns an ExtractResult with blocks (nodeId, type, text, headingLevel), comments (entityId, text, anchoredText, blockId, status, author), tracked changes (entityId, type, excerpt, author, date), and revision.',
657+
requiresDocumentContext: true,
658+
metadata: readOperation(),
659+
referenceDocPath: 'extract.mdx',
660+
referenceGroup: 'core',
661+
intentGroup: 'get_content',
662+
intentAction: 'extract',
663+
},
651664

652665
clearContent: {
653666
memberPath: 'clearContent',

packages/document-api/src/contract/operation-registry.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ import type { GetMarkdownInput } from '../get-markdown/get-markdown.js';
3535
import type { GetHtmlInput } from '../get-html/get-html.js';
3636
import type { MarkdownToFragmentInput } from '../markdown-to-fragment/markdown-to-fragment.js';
3737
import type { InfoInput } from '../info/info.js';
38+
import type { ExtractInput } from '../extract/extract.js';
39+
import type { ExtractResult } from '../types/extract.types.js';
3840
import type { ClearContentInput } from '../clear-content/clear-content.js';
3941
import type { InsertInput } from '../insert/insert.js';
4042
import type { ReplaceInput } from '../replace/replace.js';
@@ -527,6 +529,7 @@ export interface OperationRegistry extends FormatInlineAliasOperationRegistry {
527529
getHtml: { input: GetHtmlInput; options: never; output: string };
528530
markdownToFragment: { input: MarkdownToFragmentInput; options: never; output: SDMarkdownToFragmentResult };
529531
info: { input: InfoInput; options: never; output: DocumentInfo };
532+
extract: { input: ExtractInput; options: never; output: ExtractResult };
530533

531534
// --- Singleton mutations ---
532535
clearContent: { input: ClearContentInput; options: RevisionGuardOptions; output: Receipt };

packages/document-api/src/contract/schemas.ts

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2952,6 +2952,60 @@ const operationSchemas: Record<OperationId, OperationSchemaSet> = {
29522952
input: strictEmptyObjectSchema,
29532953
output: documentInfoSchema,
29542954
},
2955+
extract: {
2956+
input: strictEmptyObjectSchema,
2957+
output: objectSchema(
2958+
{
2959+
blocks: {
2960+
type: 'array',
2961+
items: objectSchema(
2962+
{
2963+
nodeId: { type: 'string', description: 'Stable block ID — pass to scrollToElement() for navigation.' },
2964+
type: { type: 'string', description: 'Block type: paragraph, heading, listItem, table, image, etc.' },
2965+
text: { type: 'string', description: 'Full plain text content of the block.' },
2966+
headingLevel: { type: 'integer', description: 'Heading level (1–6). Only present for headings.' },
2967+
},
2968+
['nodeId', 'type', 'text'],
2969+
),
2970+
},
2971+
comments: {
2972+
type: 'array',
2973+
items: objectSchema(
2974+
{
2975+
entityId: {
2976+
type: 'string',
2977+
description: 'Comment entity ID — pass to scrollToElement() for navigation.',
2978+
},
2979+
text: { type: 'string', description: 'Comment body text.' },
2980+
anchoredText: { type: 'string', description: 'The document text the comment is anchored to.' },
2981+
blockId: { type: 'string', description: 'Block ID the comment is anchored to.' },
2982+
status: { type: 'string', enum: ['open', 'resolved'] },
2983+
author: { type: 'string', description: 'Comment author name.' },
2984+
},
2985+
['entityId', 'status'],
2986+
),
2987+
},
2988+
trackedChanges: {
2989+
type: 'array',
2990+
items: objectSchema(
2991+
{
2992+
entityId: {
2993+
type: 'string',
2994+
description: 'Tracked change entity ID — pass to scrollToElement() for navigation.',
2995+
},
2996+
type: { type: 'string', enum: ['insert', 'delete', 'format'] },
2997+
excerpt: { type: 'string', description: 'Short text excerpt of the changed content.' },
2998+
author: { type: 'string', description: 'Change author name.' },
2999+
date: { type: 'string', description: 'Change date (ISO string).' },
3000+
},
3001+
['entityId', 'type'],
3002+
),
3003+
},
3004+
revision: { type: 'string', description: 'Document revision at the time of extraction.' },
3005+
},
3006+
['blocks', 'comments', 'trackedChanges', 'revision'],
3007+
),
3008+
},
29553009
clearContent: {
29563010
input: strictEmptyObjectSchema,
29573011
output: receiptResultSchemaFor('clearContent'),
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
import { describe, expect, it, mock } from 'bun:test';
2+
import type { ExtractResult } from '../types/extract.types.js';
3+
import { executeExtract } from './extract.js';
4+
import type { ExtractAdapter } from './extract.js';
5+
6+
const DEFAULT_EXTRACT: ExtractResult = {
7+
blocks: [
8+
{ nodeId: 'h1', type: 'heading', text: 'Introduction', headingLevel: 1 },
9+
{ nodeId: 'p1', type: 'paragraph', text: 'First paragraph content.' },
10+
{ nodeId: 'p2', type: 'paragraph', text: '' },
11+
],
12+
comments: [
13+
{ entityId: 'c1', text: 'Fix this', anchoredText: 'content', blockId: 'p1', status: 'open', author: 'Alice' },
14+
],
15+
trackedChanges: [{ entityId: 'tc1', type: 'insert', excerpt: 'new text', author: 'Bob', date: '2026-01-01' }],
16+
revision: '5',
17+
};
18+
19+
describe('executeExtract', () => {
20+
it('delegates to adapter.extract with the input', () => {
21+
const adapter: ExtractAdapter = {
22+
extract: mock(() => DEFAULT_EXTRACT),
23+
};
24+
25+
const result = executeExtract(adapter, {});
26+
27+
expect(result).toBe(DEFAULT_EXTRACT);
28+
expect(adapter.extract).toHaveBeenCalledWith({});
29+
});
30+
31+
it('passes through full text without truncation', () => {
32+
const longText = 'A'.repeat(200);
33+
const extractResult: ExtractResult = {
34+
...DEFAULT_EXTRACT,
35+
blocks: [{ nodeId: 'p1', type: 'paragraph', text: longText }],
36+
};
37+
const adapter: ExtractAdapter = {
38+
extract: mock(() => extractResult),
39+
};
40+
41+
const result = executeExtract(adapter, {});
42+
43+
expect(result.blocks[0].text).toBe(longText);
44+
expect(result.blocks[0].text.length).toBe(200);
45+
});
46+
});
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
import type { ExtractResult } from '../types/extract.types.js';
2+
3+
export type ExtractInput = Record<string, never>;
4+
5+
/**
6+
* Engine-specific adapter that provides document content extraction.
7+
*/
8+
export interface ExtractAdapter {
9+
/**
10+
* Extract all document content with stable IDs for RAG pipelines.
11+
*/
12+
extract(input: ExtractInput): ExtractResult;
13+
}
14+
15+
/**
16+
* Execute an extract operation through the provided adapter.
17+
*/
18+
export function executeExtract(adapter: ExtractAdapter, input: ExtractInput): ExtractResult {
19+
return adapter.extract(input);
20+
}

packages/document-api/src/index.ts

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ import type {
6363
SDMutationReceipt,
6464
TrackChangeInfo,
6565
TrackChangesListResult,
66+
ExtractResult,
6667
} from './types/index.js';
6768
import type { CommentInfo, CommentsListQuery, CommentsListResult } from './comments/comments.types.js';
6869
import type {
@@ -115,6 +116,7 @@ import {
115116
} from './markdown-to-fragment/markdown-to-fragment.js';
116117
import type { SDMarkdownToFragmentResult } from './types/sd-contract.js';
117118
import { executeInfo, type InfoAdapter, type InfoInput } from './info/info.js';
119+
import { executeExtract, type ExtractAdapter, type ExtractInput } from './extract/extract.js';
118120
import {
119121
executeClearContent,
120122
type ClearContentAdapter,
@@ -889,6 +891,7 @@ export type { GetTextAdapter, GetTextInput } from './get-text/get-text.js';
889891
export type { GetMarkdownAdapter, GetMarkdownInput } from './get-markdown/get-markdown.js';
890892
export type { GetHtmlAdapter, GetHtmlInput } from './get-html/get-html.js';
891893
export type { InfoAdapter, InfoInput } from './info/info.js';
894+
export type { ExtractAdapter, ExtractInput } from './extract/extract.js';
892895
export type { WriteAdapter, WriteRequest } from './write/write.js';
893896
export type {
894897
FormatInlineAliasApi,
@@ -1531,6 +1534,11 @@ export interface DocumentApi {
15311534
* Return document summary info including document counts and capabilities.
15321535
*/
15331536
info(input: InfoInput): DocumentInfo;
1537+
/**
1538+
* Extract all document content with stable IDs for RAG pipelines.
1539+
* Returns blocks with full text, comments, and tracked changes.
1540+
*/
1541+
extract(input: ExtractInput): ExtractResult;
15341542
/**
15351543
* Clear all document body content, leaving a single empty paragraph.
15361544
*/
@@ -1695,6 +1703,7 @@ export interface DocumentApiAdapters {
16951703
getHtml: GetHtmlAdapter;
16961704
markdownToFragment: MarkdownToFragmentAdapter;
16971705
info: InfoAdapter;
1706+
extract: ExtractAdapter;
16981707
clearContent: ClearContentAdapter;
16991708
capabilities: CapabilitiesAdapter;
17001709
comments: CommentsAdapter;
@@ -1894,6 +1903,9 @@ export function createDocumentApi(adapters: DocumentApiAdapters): DocumentApi {
18941903
info(input: InfoInput): DocumentInfo {
18951904
return executeInfo(adapters.info, input);
18961905
},
1906+
extract(input: ExtractInput): ExtractResult {
1907+
return executeExtract(adapters.extract, input);
1908+
},
18971909
clearContent(input: ClearContentInput, options?: RevisionGuardOptions): Receipt {
18981910
return executeClearContent(adapters.clearContent, input, options);
18991911
},

packages/document-api/src/invoke/invoke.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ export function buildDispatchTable(api: DocumentApi): TypedDispatchTable {
6767
getHtml: (input) => api.getHtml(input),
6868
markdownToFragment: (input) => api.markdownToFragment(input),
6969
info: (input) => api.info(input),
70+
extract: (input) => api.extract(input),
7071

7172
// --- Singleton mutations ---
7273
clearContent: (input, options) => api.clearContent(input, options),

0 commit comments

Comments
 (0)