Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
bcc4d69
feat(db): add provenance foundation
caiopizzol Apr 27, 2026
a0516ad
feat(db): add empty XSD schema tables (Phase 2)
caiopizzol Apr 27, 2026
98348c1
fix(db): xsd_compositors XOR check on parent
caiopizzol Apr 27, 2026
305873a
feat(xsd): scaffold ECMA Transitional fetch (Phase 3a)
caiopizzol Apr 27, 2026
b8c36ac
chore(xsd): drop unused existsSync import
caiopizzol Apr 27, 2026
d6be885
chore(xsd): pin ECMA Transitional Part 4 zip hash
caiopizzol Apr 27, 2026
0f319c7
feat(xsd): parser scaffolding (Phase 3b)
caiopizzol Apr 27, 2026
a1d7cba
feat(xsd): symbol + inheritance ingest (Phase 3c)
caiopizzol Apr 27, 2026
6cb04ac
feat(xsd): content model ingest (Phase 3d)
caiopizzol Apr 27, 2026
280e76f
fix(xsd): make Phase 3d content-model ingest idempotent
caiopizzol Apr 27, 2026
33072f5
feat(xsd): attributes, attributeGroup refs, and enums (Phase 3e)
caiopizzol Apr 27, 2026
c742f43
fix(xsd): preserve element/attr type and group-ref compositor metadata
caiopizzol Apr 27, 2026
e1c5cb0
feat(mcp): add read-only OOXML structural tools (Phase 4)
caiopizzol Apr 27, 2026
7b0898c
chore(mcp): split ooxml dispatch + add local e2e harness
caiopizzol Apr 27, 2026
99d149f
fix(mcp): correct inheritance order, compositor flattening, and neste…
caiopizzol Apr 27, 2026
f1b3223
chore(mcp): biome line wrapping
caiopizzol Apr 27, 2026
cb3e16d
fix(xsd): scope local elements per-owner; link xsd-builtin symbols to…
caiopizzol Apr 27, 2026
5013eec
chore: clean up internal phase markers and reorganize scripts
caiopizzol Apr 27, 2026
99ec4eb
fix(mcp): address review correctness gaps + tighten test isolation
caiopizzol Apr 27, 2026
076f096
chore(mcp): remove ENABLE_OOXML_TOOLS feature flag
caiopizzol Apr 27, 2026
b473736
chore: drop ooxml-call dev harness; add data/README; surface structur…
caiopizzol Apr 27, 2026
b313a9e
feat(xsd): default fetch URL + sha256 to data/sources.json
caiopizzol Apr 27, 2026
aee4d87
feat(sources): pin all four ECMA-376 parts in the manifest
caiopizzol Apr 27, 2026
07f1086
chore(xsd): fix stale db:sync-sources reference in ingest error message
caiopizzol Apr 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,10 @@ dev/
.wrangler/
.env
.mcp.json
.vscode/
.vscode/

# Local-only planning doc (public repo)
PLAN.md

# XSD/spec artifacts: pulled by scripts/fetch-xsd.ts; never committed.
data/xsd-cache/
58 changes: 47 additions & 11 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,22 @@ apps/
src/data/docs.ts ← All doc pages live here (single source of truth)
src/components/ UI components (Sidebar, SuperDocPreview, etc.)
src/pages/ Route pages (Home, Docs, SpecExplorer, Mcp)
mcp-server/ Cloudflare Worker MCP server for AI spec search
mcp-server/ Cloudflare Worker - MCP server (semantic + structural tools)
packages/
shared/ Database client, embedding client, types
scripts/
ingest/ PDF → chunks → embeddings → database pipeline
ingest-pdf/ ECMA PDF -> spec_content (semantic search corpus)
ingest-xsd/ ECMA XSDs -> schema graph (structural query corpus)
sources-sync.ts data/sources.json -> reference_sources
db-migrate.ts Apply db/migrations/*.sql in order
db/
schema.sql PostgreSQL + pgvector schema
schema.sql PostgreSQL + pgvector + XSD schema graph
migrations/ Numbered, idempotent SQL migrations
data/
sources.json Source manifest (artifact URLs, sha256, license notes)
xsd-cache/ Local-only XSD download cache (gitignored)
dev/
data/ Extracted/chunked/embedded spec content
data/ Extracted/chunked/embedded PDF content
```

## Commands
Expand Down Expand Up @@ -97,23 +104,52 @@ The XML you provide is wrapped in a minimal `w:document > w:body` structure auto

## MCP Server

Cloudflare Worker exposing three MCP tools for semantic spec search:
Cloudflare Worker exposing two flavors of MCP tools backed by the same database.

- `search_ecma_spec` — semantic vector search across 18,000+ spec chunks
- `get_section` — fetch a specific section by ID (e.g., "17.3.1.24")
- `list_parts` — browse the spec structure
Semantic search over the spec PDF (powered by `spec_content`):

- `search_ecma_spec` - semantic vector search across 18,000+ spec chunks
- `get_section` - fetch a specific section by ID (e.g., "17.3.1.24")
- `list_parts` - browse the spec structure

Structural queries over the XSD schema graph (powered by `xsd_*` tables):

- `ooxml_lookup_element` / `ooxml_lookup_type` - canonical symbol info
- `ooxml_children` - legal children of an element/type/group, in document order
- `ooxml_attributes` - attributes including those inherited and unfolded from attributeGroup refs
- `ooxml_enum` - simpleType enumeration values
- `ooxml_namespace_info` - vocabularies and per-profile symbol counts for a namespace URI

Uses PostgreSQL with pgvector (Neon serverless in production, Docker locally).

## Data Pipeline
## Data Pipelines

Two ingest paths feed the same database. Both are reproducible from `data/sources.json`.

Ingests ECMA-376 PDFs into the vector database:
**PDF (semantic corpus, into `spec_content`)**:

```
PDF → extract (Python) → chunk (6KB) → embed (Voyage) → upload (PostgreSQL)
```

Run the full pipeline: `bun scripts/ingest/pipeline.ts`
```bash
bun run pdf:ingest 1 ./pdfs/ECMA-376-Part1.pdf # full pipeline for one part
```

See `scripts/ingest-pdf/README.md`.

**XSD (structural corpus, into `xsd_*` tables)**:

```
ECMA Part 4 zip → fetch+verify (sha256) → parse → ingest (single transaction)
```

```bash
bun run xsd:fetch # URL + sha256 from data/sources.json
bun run xsd:ingest
```

See `scripts/ingest-xsd/README.md`.

## Database

Expand Down
14 changes: 9 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,10 @@ The OOXML spec, explained by people who actually implemented it.

An interactive reference for ECMA-376 (Office Open XML) built by the [SuperDoc — DOCX editing and tooling](https://superdoc.dev) team. Every page combines XML structure, live rendered previews, and implementation notes that tell you what the spec doesn't.

- **Live previews** — Edit XML and see it render in real-time. Every example is a working document.
- **Implementation notes** — Where Word diverges from the spec, what will break your code, and what to do about it.
- **Semantic spec search** — 18,000+ spec chunks searchable by meaning via MCP server.
- **Live previews** - Edit XML and see it render in real-time. Every example is a working document.
- **Implementation notes** - Where Word diverges from the spec, what will break your code, and what to do about it.
- **Semantic spec search** - 18,000+ spec chunks searchable by meaning via MCP server.
- **Structural schema lookup** - Element children, attributes, types, enums, namespaces. Same MCP server, deterministic answers from the parsed XSDs.

## Why?

Expand All @@ -22,13 +23,16 @@ We faced this at SuperDoc — building a document engine on native OOXML with no

## MCP Server

Search the ECMA-376 spec with AI. Ask questions in natural language, get answers grounded in the actual specification.
Ask questions in natural language and get answers grounded in the spec, or query the schema graph for precise structural answers.

```bash
claude mcp add --transport http ecma-spec https://api.ooxml.dev/mcp
```

Works with Claude Code, Cursor, and any MCP-compatible client. Three tools: `search_ecma_spec` (semantic search), `get_section` (by ID), and `list_parts` (browse structure).
Works with Claude Code, Cursor, and any MCP-compatible client. Two flavors of tools share one server:

- **Semantic** (over the spec PDF): `search_ecma_spec`, `get_section`, `list_parts`
- **Structural** (over the parsed XSDs): `ooxml_lookup_element`, `ooxml_lookup_type`, `ooxml_children`, `ooxml_attributes`, `ooxml_enum`, `ooxml_namespace_info`

## Development

Expand Down
16 changes: 13 additions & 3 deletions apps/mcp-server/src/mcp.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import { createDb } from "./db";
import { embedQuery } from "./embeddings";
import type { Env } from "./index";
import { callOoxmlTool, isOoxmlTool, OOXML_TOOL_DEFS } from "./ooxml-tools";

// JSON-RPC types
interface JsonRpcRequest {
Expand Down Expand Up @@ -136,9 +137,7 @@ function handleToolsList(id: number | string | null): JsonRpcResponse {
return {
jsonrpc: "2.0",
id,
result: {
tools: TOOLS,
},
result: { tools: [...TOOLS, ...OOXML_TOOL_DEFS] },
};
}

Expand All @@ -162,6 +161,17 @@ async function handleToolsCall(
try {
let resultText: string;

// Structural OOXML tools share the dispatch with the existing semantic
// tools below.
if (isOoxmlTool(name)) {
resultText = await callOoxmlTool(name, args ?? {}, env);
return {
jsonrpc: "2.0",
id,
result: { content: [{ type: "text", text: resultText }] },
};
}

switch (name) {
case "search_ecma_spec": {
const query = args?.query as string;
Expand Down
Loading