Skip to content

Commit ca48ab9

Browse files
committed
feat(compare-rendering): add Word↔SuperDoc paragraph-diff CLI (M1)
Diffs resolved paragraph state between Word (via word-mcp run_powershell on a Windows VM) and SuperDoc (via layout:export-one) for paragraph-only docx files. Emits typed Finding[] with category/severity/specRef/codeArea hints so an agent consumer can route fixes to the right SuperDoc module. Unsupported features (tables, inline/floating shapes, tracked changes, comments) short-circuit with a skipped finding rather than producing a misleading diff. Word extraction is cached by sha256(docx) + sha256(extract-layout.ps1) so PS edits bust the cache automatically. Scope: paragraph-only flow. Categories emitted: text, pagination, structure, unsupported. Style/indent/color/numbering deferred to M2.
1 parent 3e9b368 commit ca48ab9

17 files changed

Lines changed: 1102 additions & 0 deletions
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.cache/
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# compare-rendering
2+
3+
Diffs Word and SuperDoc rendering of the same `.docx` at the *resolved schema* level — text, page assignment, and (in later milestones) font/indent/color/numbering. Emits typed `Finding[]` so an agent can route fixes to specific SuperDoc modules.
4+
5+
This is a dev tool, not a pass/fail test. It surfaces concrete divergences so you don't have to compare screenshots by eye.
6+
7+
## Scope (M1)
8+
9+
- **Supported:** paragraph-only documents (text-heavy memos, letters, policies).
10+
- **Short-circuited with a reason:** docs containing tables, inline/floating shapes, or tracked changes. The report emits an `unsupported` finding and skips the diff — honest boundary rather than a misleading "everything looks fine."
11+
- **Categories emitted in M1:** `text`, `pagination`, `structure`, `unsupported`. Style/indent/color/numbering come in M2 once the SuperDoc-side normalizer pulls resolved values out of `measures[]` and `runs[]`.
12+
13+
## Quick start
14+
15+
```bash
16+
export WORD_MCP_URL="https://word-mcp.superdoc.workers.dev/mcp"
17+
export WORD_MCP_TOKEN="<your-bearer-token>"
18+
19+
pnpm compare-rendering -- \
20+
--input evals/fixtures/docs/memorandum.docx \
21+
--format md
22+
```
23+
24+
Run directly without the wrapper:
25+
26+
```bash
27+
bun devtools/compare-rendering/src/cli.ts --input <path> --format md
28+
```
29+
30+
Example output (truncated):
31+
32+
```markdown
33+
# compare-rendering: memorandum.docx
34+
35+
- Word pages: 3, SuperDoc pages: 3
36+
- Word paragraphs: 94, SuperDoc paragraphs: 94
37+
38+
## Findings (2)
39+
40+
### pagination (2)
41+
- **[visible]** Paragraph #39 landed on page 1 in SuperDoc but page 2 in Word (empty line)
42+
- spec: ECMA-376 §17.3.1.16 (keepNext/keepLines/pageBreakBefore)
43+
- code: `layout-engine/layout-engine/src/pagination`
44+
- **[visible]** Paragraph #80 landed on page 2 in SuperDoc but page 3 in Word (" - Any press releases…")
45+
- spec: ECMA-376 §17.3.1.16 (keepNext/keepLines/pageBreakBefore)
46+
- code: `layout-engine/layout-engine/src/pagination`
47+
```
48+
49+
## How it works
50+
51+
```
52+
docx
53+
├── word adapter (POST run_powershell to word-mcp worker) ─► word.json (cached)
54+
└── superdoc adapter (spawn pnpm layout:export-one) ─► sd.layout.json
55+
56+
normalize both sides
57+
58+
NormalizedParagraph[] × 2
59+
60+
differ + taxonomy
61+
62+
Finding[] report
63+
```
64+
65+
- Word extraction is **cached** by `sha256(docx) + sha256(extract-layout.ps1)`. Editing SuperDoc code and re-running the tool only re-runs the SuperDoc side — no re-hit to the VM (~25s saved per iteration). Editing the PowerShell script busts the cache automatically.
66+
- Bypass the cache for a single run with `--no-cache`.
67+
68+
## Env
69+
70+
| Variable | Purpose |
71+
|------------------|------------------------------------------------------|
72+
| `WORD_MCP_URL` | HTTP endpoint of the word-mcp MCP worker |
73+
| `WORD_MCP_TOKEN` | Bearer token (same one you use in your `.mcp.json`) |
74+
75+
## Exit codes
76+
77+
- `0` — ran successfully; findings are at most `visible`/`cosmetic` (or no findings at all)
78+
- `1` — tool error (network, missing input, bad args)
79+
- `2` — ran successfully but emitted at least one `blocking` finding
80+
81+
Makes it CI-usable later without rework.
82+
83+
## Non-goals
84+
85+
- Pixel diffing (see `tests/visual/`).
86+
- Tables, images, shapes, track changes, headers/footers, comments, TOC — deferred past M5.
87+
- Auto-fix generation.
88+
- Publishing as a package.
89+
90+
## Milestones
91+
92+
- **M1** (this): CLI works end-to-end on paragraph-only docs. 3 categories. JSON + markdown output. Caching.
93+
- **M2**: Pull resolved style fields out of SuperDoc's block schema. Taxonomy extends to `style`, `indent`, `font`, `color`, `alignment`, `spacing`, `numbering`.
94+
- **M3**: Batch mode (`--input-dir`), nightly run against the paragraph-only subset of the corpus, per-category dashboard.
95+
- **M4**: MCP wrapper `compare_rendering(docx_path)`. Agent dogfood with ECMA-spec MCP in context.
96+
- **M5**: Table support. Non-trivial — needs parallel table walks on both sides.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"private": true,
3+
"type": "module",
4+
"name": "compare-rendering",
5+
"scripts": {
6+
"start": "bun src/cli.ts",
7+
"typecheck": "tsc --noEmit",
8+
"test": "vitest run"
9+
}
10+
}
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import { createHash } from 'node:crypto';
2+
import { mkdir, readFile, stat, writeFile } from 'node:fs/promises';
3+
import { fileURLToPath } from 'node:url';
4+
import { dirname, join } from 'node:path';
5+
6+
const CACHE_DIR = fileURLToPath(new URL('../.cache/word', import.meta.url));
7+
8+
export function sha256(bytes: Uint8Array | string): string {
9+
const h = createHash('sha256');
10+
h.update(bytes);
11+
return h.digest('hex');
12+
}
13+
14+
export async function hashFile(path: string): Promise<string> {
15+
return sha256(await readFile(path));
16+
}
17+
18+
function cachePath(sha: string, keySuffix: string): string {
19+
return join(CACHE_DIR, `${sha}-${keySuffix}.json`);
20+
}
21+
22+
export async function readCache<T>(sha: string, keySuffix: string): Promise<T | null> {
23+
const p = cachePath(sha, keySuffix);
24+
try {
25+
await stat(p);
26+
} catch {
27+
return null;
28+
}
29+
return JSON.parse(await readFile(p, 'utf8')) as T;
30+
}
31+
32+
export async function writeCache<T>(sha: string, keySuffix: string, value: T): Promise<void> {
33+
const p = cachePath(sha, keySuffix);
34+
await mkdir(dirname(p), { recursive: true });
35+
await writeFile(p, JSON.stringify(value), 'utf8');
36+
}
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
#!/usr/bin/env bun
2+
import { parseArgs as nodeParseArgs } from 'node:util';
3+
import { resolve } from 'node:path';
4+
import { writeFile } from 'node:fs/promises';
5+
import { extractWord } from './word.ts';
6+
import { extractSuperDoc } from './superdoc.ts';
7+
import { normalizeSuperDoc, normalizeWord } from './normalize.ts';
8+
import { diffParagraphs } from './differ.ts';
9+
import { formatJson, formatMarkdown } from './format.ts';
10+
import type { CompareReport, Finding } from './types.ts';
11+
12+
type Args = {
13+
input: string;
14+
output?: string;
15+
format: 'json' | 'md';
16+
pipeline: 'presentation' | 'headless';
17+
cache: boolean;
18+
};
19+
20+
const USAGE = `compare-rendering — diff Word vs SuperDoc rendering (paragraph-only scope)
21+
22+
Usage:
23+
pnpm compare-rendering -- --input <docx> [options]
24+
25+
Options:
26+
--input <path> Required. Path to a .docx file.
27+
--output <path> Write the report to a file (default: stdout).
28+
--format json|md Output format (default: json).
29+
--pipeline presentation|headless SuperDoc layout pipeline (default: presentation).
30+
--no-cache Bypass the Word extraction cache.
31+
-h, --help Show this help.
32+
33+
Env:
34+
WORD_MCP_URL HTTP endpoint of the word-mcp worker.
35+
WORD_MCP_TOKEN Bearer token for the worker.
36+
37+
Exit codes:
38+
0 — ran; findings are at most visible/cosmetic.
39+
1 — tool error (network, missing file, bad args).
40+
2 — ran; emitted at least one blocking finding.`;
41+
42+
function parseArgs(argv: string[]): Args {
43+
const { values } = nodeParseArgs({
44+
args: argv,
45+
options: {
46+
input: { type: 'string' },
47+
output: { type: 'string' },
48+
format: { type: 'string', default: 'json' },
49+
pipeline: { type: 'string', default: 'presentation' },
50+
'no-cache': { type: 'boolean', default: false },
51+
help: { type: 'boolean', short: 'h', default: false },
52+
},
53+
strict: true,
54+
allowPositionals: false,
55+
});
56+
57+
if (values.help) {
58+
console.log(USAGE);
59+
process.exit(0);
60+
}
61+
62+
if (!values.input) throw new Error('--input <docx> is required');
63+
if (values.format !== 'json' && values.format !== 'md') {
64+
throw new Error(`--format must be json or md, got "${values.format}"`);
65+
}
66+
if (values.pipeline !== 'presentation' && values.pipeline !== 'headless') {
67+
throw new Error(`--pipeline must be presentation or headless, got "${values.pipeline}"`);
68+
}
69+
70+
return {
71+
input: values.input,
72+
output: values.output,
73+
format: values.format,
74+
pipeline: values.pipeline,
75+
cache: !values['no-cache'],
76+
};
77+
}
78+
79+
function hasBlocking(findings: Finding[]): boolean {
80+
return findings.some((f) => f.severity === 'blocking');
81+
}
82+
83+
const log = (msg: string) => console.error(`[compare-rendering] ${msg}`);
84+
85+
async function main(): Promise<void> {
86+
const args = parseArgs(process.argv.slice(2));
87+
const docxPath = resolve(args.input);
88+
89+
log(`word: extracting ${docxPath}`);
90+
const wordStart = Date.now();
91+
const { extraction: wordExtraction, sha, cached } = await extractWord(docxPath, { cache: args.cache });
92+
log(`word: ${cached ? 'cached' : 'fresh'} extraction in ${Date.now() - wordStart}ms (sha=${sha.slice(0, 12)})`);
93+
94+
if (!wordExtraction.supported) {
95+
const report: CompareReport = {
96+
docxPath,
97+
docxSha: sha,
98+
wordSupported: false,
99+
unsupportedReason: wordExtraction.unsupportedReason,
100+
counts: {
101+
wordParagraphs: 0,
102+
superdocParagraphs: 0,
103+
wordPages: wordExtraction.pageCount,
104+
superdocPages: 0,
105+
},
106+
findings: [
107+
{
108+
category: 'unsupported',
109+
severity: 'cosmetic',
110+
paragraphOrdinal: 0,
111+
word: wordExtraction.unsupportedReason,
112+
superdoc: null,
113+
message: `Document skipped: ${wordExtraction.unsupportedReason ?? 'unsupported'}`,
114+
},
115+
],
116+
};
117+
await emit(report, args);
118+
return;
119+
}
120+
121+
log('superdoc: running layout:export-one');
122+
const sdStart = Date.now();
123+
const sdExtraction = await extractSuperDoc(docxPath, { pipeline: args.pipeline });
124+
log(`superdoc: extracted in ${Date.now() - sdStart}ms`);
125+
126+
const wordParas = normalizeWord(wordExtraction);
127+
const sdParas = normalizeSuperDoc(sdExtraction);
128+
129+
const findings = diffParagraphs(wordParas, sdParas);
130+
131+
const report: CompareReport = {
132+
docxPath,
133+
docxSha: sha,
134+
wordSupported: true,
135+
counts: {
136+
wordParagraphs: wordParas.length,
137+
superdocParagraphs: sdParas.length,
138+
wordPages: wordExtraction.pageCount,
139+
superdocPages: sdExtraction.pageCount,
140+
},
141+
findings,
142+
};
143+
144+
await emit(report, args);
145+
if (hasBlocking(findings)) process.exitCode = 2;
146+
}
147+
148+
async function emit(report: CompareReport, args: Args): Promise<void> {
149+
const out = args.format === 'md' ? formatMarkdown(report) : formatJson(report);
150+
if (args.output) {
151+
await writeFile(resolve(args.output), out, 'utf8');
152+
log(`wrote ${resolve(args.output)}`);
153+
} else {
154+
process.stdout.write(out);
155+
}
156+
}
157+
158+
main().catch((e) => {
159+
console.error(`[compare-rendering] error: ${(e as Error).message}`);
160+
process.exit(1);
161+
});

0 commit comments

Comments
 (0)