Skip to content

feat(mcp): OOXML reference engine#3

Merged
caio-pizzol merged 24 commits into
mainfrom
caio/ooxml-reference-phase-4-mcp-tools
Apr 27, 2026
Merged

feat(mcp): OOXML reference engine#3
caio-pizzol merged 24 commits into
mainfrom
caio/ooxml-reference-phase-4-mcp-tools

Conversation

@caio-pizzol

Copy link
Copy Markdown
Contributor

Adds the OOXML reference engine: a profile-scoped schema graph for ECMA-376 Transitional plus six read-only MCP tools that query it. Lets implementers and agents ask "what children can w:tbl contain", "what attributes does w:pBdr accept", "what values can w:jc/@w:val take" and get an ordered, inheritance-aware answer instead of grep-ing the spec.

The existing search_ecma_spec / get_section / list_parts and the spec_content table stay untouched. New tools live behind ENABLE_OOXML_TOOLS and are filtered out of tools/list when the flag is off, so api.ooxml.dev/mcp is byte-identical until the flag flips.

  • Provenance and schema in 11 new tables; data/sources.json + reference_sources track every artifact with sha256. XSDs are pulled into a gitignored cache by bun run xsd:fetch; nothing binary lands in the repo.
  • XSD ingest is fully idempotent: same input, same row counts, every time. Local elements get profile membership and type_ref so element-to-type chains resolve from the DB.
  • Query layer walks XSD semantics correctly: complexContent/extension prepends base content, restriction replaces it; nested compositors flatten in document order via DFS; attributeGroup refs unfold recursively with cycle protection.
  • WML closure ingest result: 12 docs, 2737 symbols, 2098 child edges, 1114 attr edges, 389 inheritance edges, 0 unresolved.

Verified: bun run xsd:fetch ... -> 26 XSDs in data/xsd-cache/; bun run db:migrate && bun run db:sync-sources && bun run xsd:ingest -> stable counts on re-run; bun test -> 39 / 0 across db / ingest / mcp-server; bun run ooxml:call against a local Postgres mirror answers all five spec acceptance queries (w:tbl children, w:tblGrid lookup, w:jc/@w:val enum, w:pBdr attrs, unknown qname).

Review: structural correctness is the focus - inheritance ordering for extensions, document order across nested sequences/choices, recursive attributeGroup unfolding. Ignore deployment side: the Worker bundle still builds (263 KiB), the Phase 4 tools default to disabled, and idempotent migrations keep spec_content and the existing public surface unchanged.

Adds reference_sources and source_id FK on spec_content so every
chunk can be traced to a known source. data/sources.json is the
human-edited manifest; sync-sources upserts and backfills.

name is the stable identity; edition/version update in place when
verified, so re-tagging an existing source does not orphan its
references. Establishes db/migrations/ convention with a small runner.
Profile-scoped symbol graph for OOXML schemas: xsd_profiles,
xsd_namespaces, xsd_symbols, xsd_symbol_profiles, xsd_compositors
(with parent_compositor_id for nested sequences/choices), xsd_child_edges
(parent_symbol_id denormalized for fast 'children of X' queries),
xsd_attr_edges, xsd_group_edges, xsd_inheritance_edges, xsd_enums,
behavior_notes (claim_type enum locked now; Phase 5 populates).

All tables empty after this migration. Integration tests verify
constraint enforcement, CASCADE delete, and a realistic
'children of w:tbl in transitional' query path.
Top-level (parent_symbol_id) and nested (parent_compositor_id) compositors
are mutually exclusive in the model. The previous OR check let a single row
claim both, which would make traversal/children queries ambiguous. Tightened
to XOR in both schema.sql and the migration; test now also rejects the
both-set case.
Adds scripts/fetch-xsd.ts which downloads the ECMA-376 Part 4 zip,
verifies sha256, extracts the inner OfficeOpenXML-XMLSchema-Transitional
zip, and lands the 26 XSDs under data/xsd-cache/ecma-376-transitional/.

Cache is gitignored; manifest tracks the source identity, the canonical
publications URL, and (after first fetch) the outer-zip sha256 for
reproducibility. The Part 4 URL is supplied at fetch time via --url
or XSD_PART4_URL.

Also softens the ECMA license_note to neutral wording.
Direct ECMA download URL plus outer-zip sha256 captured after a
successful fetch+extract; reproducible via `bun run xsd:fetch`.
parseSchemaSet({ schemaDir, entrypoints }) loads a working set of XSDs,
follows xsd:import schemaLocation references recursively, and indexes
every top-level declaration (element/complexType/simpleType/group/
attributeGroup/attribute) by canonical Clark-style qname.

fast-xml-parser configured with preserveOrder so sibling order across
different tag names is retained, and no value coercion that would
mutate XSD attribute strings.

Returns a typed schema set:
  - documents: per-file metadata + raw schemaNode
  - namespaceByPrefix: per-document prefix -> URI maps
  - importGraph: per-document outgoing imports with resolved targets
  - declarationsByQName: canonical qname -> declarations[]

QName resolution is conservative: declaration qnames use the document's
target namespace; attribute qnames (ref/type/base) resolve through the
document's prefix map and surface as { resolved: false } when the prefix
or namespace is unknown rather than guessing.

No DB writes in this phase. Smoke command bun run xsd:smoke parses
wml.xsd from the cache and reports counts (820 complexTypes,
389 simpleTypes, 67 groups, 47 elements, etc).

Also tightens DB test isolation: an afterAll TRUNCATE leaves the dev
DB clean instead of carrying the last test's xsd_profiles row.
ingestSchemaSet wraps parseSchemaSet and writes:
  - xsd_profiles      (bootstrap target profile)
  - xsd_namespaces    (one per unique URI)
  - xsd_symbols       (canonical (vocabulary_id, local_name, kind), upsert)
  - xsd_symbol_profiles (membership for the target profile, with source_id)
  - xsd_inheritance_edges (extension/restriction from
    complexContent/simpleContent and simpleType/restriction)

The whole ingest runs in one transaction. Re-runs are no-ops via UNIQUE +
ON CONFLICT DO NOTHING; stale-row cleanup is deferred per PLAN.md's
edition-flip open item.

QName base resolution uses the document's prefix map. Built-in xsd:*
bases are auto-created on demand as kind=simpleType in vocabulary
xsd-builtin so the FK on xsd_inheritance_edges.base_symbol_id holds.

Phase 3c does not touch compositors, child edges, attributes, group refs,
or enums (those are 3d/3e).

Tests: fixture-driven happy path, idempotency check, plus an optional
real-cache smoke test against the WML closure (12 docs, ~1359 symbols,
~389 inheritance edges, all bases resolved).

Fixture main.xsd gains CT_Extended (extends CT_Empty) and CT_Restricted
(restricts CT_Para) so the inheritance walker is exercised on both
forms; existing parser test counts adjusted to match.
Pass 3 of ingestSchemaSet walks every complexType and group declaration
and writes xsd_compositors, xsd_child_edges, and xsd_group_edges.

Compositor handling:
  - sequence/choice/all under a complexType (or under
    complexContent/extension|restriction) become top-level compositors
    with parent_symbol_id set.
  - Nested compositors (sequence inside choice etc.) recurse with
    parent_compositor_id set; the XOR check guarantees exactly one
    parent dimension is populated.
  - simpleContent contributes attributes only and is skipped here.

Element handling inside compositors:
  - ref="..."  resolves through the document's prefix map to a top-level
    symbol; child_edge points at it.
  - name="..." (local) creates / reuses a symbol under the owner vocabulary
    (vocab, name, kind=element). Cross-CT name reuse collapses; that is a
    known imprecision until we need to disambiguate.

Group refs become xsd_group_edges with resolved=false; future passes can
expand them. attributeGroup refs are still Phase 3e (attributes).

WML closure ingest stats:
  - 2737 symbols (1345 declarations + 14 builtins + 1378 local elements)
  - 585 compositors
  - 2098 child edges (0 unresolved)
  - 161 group refs (0 unresolved)
  - 389 inheritance edges (0 unresolved)
  - elapsed ~2s

Fixture main.xsd gains CT_Body to exercise nested compositors,
ref-vs-name elements, and group refs in one test path.
Compositors / child_edges / group_edges have no natural unique key
(a complexType can hold sibling sequences/choices), so the prior pass
unconditionally inserted on every run, doubling rows on the second
ingest. CT_Tbl content lookups against a re-ingested DB returned 0
rows because the order_index ranges no longer matched what queries
expected.

Switching to delete-and-rewrite per profile at the start of pass 3:
  DELETE FROM xsd_compositors  WHERE profile_id = ?
  DELETE FROM xsd_group_edges  WHERE profile_id = ?
xsd_child_edges cleans up automatically via FK CASCADE on
compositor_id. Inheritance / symbols / memberships stay upsert-only
since they have natural keys.

Idempotency test now also asserts compositor / child-edge / group-ref
counts in the DB match the first-run insert counts after a second run.

Verified: two consecutive `bun run xsd:ingest` against the WML closure
both produce 585 compositors / 2098 child edges / 161 group refs and
the DB ends at exactly those counts.
Pass 4 of ingestSchemaSet walks every complexType and attributeGroup
declaration and writes:
  - xsd_attr_edges      one row per direct or extension/restriction
                        attribute. attr_use enum locked to required /
                        optional / prohibited; default 'optional'. type_ref
                        stores the Clark-style {namespace}localName so
                        Phase 4 lookups can join across vocabularies, with
                        the raw qname as a fallback when unresolvable.
  - xsd_group_edges     additional rows with ref_kind='attributeGroup' for
                        every <xsd:attributeGroup ref="..."/> on a
                        complexType or another attributeGroup body.
  - xsd_enums           one row per <xsd:enumeration value="..."/>
                        beneath a simpleType restriction; order_index
                        preserved.

Idempotency: same delete-and-rewrite-per-profile pattern as Pass 3.
xsd_group_edges already gets cleared by Pass 3 so attributeGroup ref
inserts here run on a fresh slate.

attribute parents handled:
  - complexType direct (no wrapper)
  - complexContent / extension|restriction
  - simpleContent  / extension|restriction
  - attributeGroup body (top-level)

WML closure ingest stats:
  - 1114 attr edges (2 unresolved: xml:space / xml:lang)
  - 17 attributeGroup refs (0 unresolved)
  - 2189 enum values
  - elapsed ~3s
  - all unresolved counters elsewhere still 0

Real-data sanity: top attribute-heavy types match expectation
(CT_ElemPropSet 28, CT_TextBodyProperties 19, ...). type_ref distribution
shows xsd:boolean, ST_OnOff, ST_DecimalNumber, etc resolved to the right
namespaces.

Fixture main.xsd gains CT_TableUser to exercise an attributeGroup ref +
a required attribute, alongside the existing direct, extension, and
attributeGroup-body attribute paths and the ST_Jc enum.
Three correctness gaps surfaced before Phase 4:

P1 - Local elements lost type and profile membership.
  WML uses <xsd:element name="p" type="CT_P"/> inside groups; before this
  change the local element symbol carried no @type and was never linked
  to xsd_symbol_profiles, so ooxml_lookup_element/ooxml_children would
  not find it in the transitional profile or follow it to CT_P.

P2 - Group refs in nested compositors lost context.
  <xsd:group ref> inside a nested sequence/choice was inserted with
  parent_symbol_id and order_index only. The compositor it lives inside
  and the ref's own minOccurs/maxOccurs were dropped, so later expansion
  could not preserve ordering or cardinality relative to siblings.

P2 - Referenced attributes lost type/default/fixed.
  <xsd:attribute ref="r:id"/> set attr_symbol_id only; the type and
  default declared on the top-level <xsd:attribute name="id"
  type="ST_RelationshipId"/> were not recovered into the edge.

Migration 0003_phase3_metadata adds:
  - xsd_symbols.type_ref TEXT (Clark-style {namespace}localName for
    elements and attributes that declare @type; NULL for the rest).
  - xsd_group_edges.compositor_id INT (FK with ON DELETE CASCADE),
    plus min_occurs / max_occurs.

ingest.ts:
  - upsertSymbol now accepts typeRef; ON CONFLICT preserves the existing
    value via COALESCE so a re-run never blanks it out.
  - Pass 1 captures @type for top-level element/attribute decls.
  - Pass 3 captures @type and links local elements to xsd_symbol_profiles.
  - Pass 3 group refs thread compositor_id and parse min/max occurs.
  - Pass 4 attribute refs copy type_ref / default / fixed from the
    top-level declaration; attr_use stays from the ref site (XSD lets
    refs override use only).

Real WML ingest after fix:
  - profile memberships: 1345 -> 2723 (1345 top-level + 1378 local
    elements now visible to ooxml_lookup_element).
  - 148 / 161 group refs carry compositor_id (rest are top-level).
  - Sample r:id attribute refs now expose
    type_ref={...relationships}ST_RelationshipId.

Fixtures gain a top-level <xsd:attribute name="space"
type="xsd:string" default="preserve"/> in shared.xsd and a CT_RefTest
in main.xsd that refs it; the new test checks all three fixes.
Six new MCP tools, gated by the ENABLE_OOXML_TOOLS env var. tools/list
filters them out and tools/call returns method-not-found until the flag
is set, so api.ooxml.dev/mcp's existing surface (search_ecma_spec /
get_section / list_parts) is unaffected.

Tools:
  ooxml_lookup_element  qname (w:tbl, {ns}local, or bare) -> symbol info
  ooxml_lookup_type     qname -> complexType or simpleType symbol
  ooxml_children        element/type/group qname -> ordered child + group ref list
  ooxml_attributes      element/type qname -> attrs unfolded through inheritance
                        and attributeGroup refs
  ooxml_enum            simpleType qname -> enumeration values in declared order
  ooxml_namespace_info  uri -> profiles + symbol counts per profile

Query layer (apps/mcp-server/src/ooxml-queries.ts):
  - parseQName accepts known OOXML prefixes (w/r/s/m/a/wp/pic/c/dgm/xsd),
    Clark form, or bare local names (defaults to wml-main).
  - lookupElement / lookupType / lookupSymbolByTypeRef walk
    xsd_symbol_profiles for profile-scoped hits.
  - getChildren walks the xsd_inheritance_edges chain via a recursive CTE
    and unions self + base xsd_child_edges and xsd_group_edges (group refs)
    in document order. Each entry carries its compositor kind and the type
    that contributed it.
  - getAttributes does the same and additionally recurses through
    attributeGroup refs; each entry carries 'self' / 'inherited' /
    'attributeGroup' provenance with the owning name.
  - getEnums and getNamespaceInfo are direct profile-scoped lookups.

Tool dispatch (apps/mcp-server/src/ooxml-tools.ts):
  - For element qnames passed to ooxml_children / ooxml_attributes the
    handler looks up the element, follows type_ref to its complexType,
    then reads from there (per the Phase 4 caveat in PLAN.md).
  - ooxml_children also falls back to looking up groups by name so users
    can call it on EG_PContent etc.
  - Unknown qnames produce a 'Not found' card listing alternative formats
    and the searched profile.
  - Default profile is literal 'transitional' until Phase 6.

Response shape per PLAN.md: canonical symbol, namespace, type_ref where
relevant, source, and a behavior-notes placeholder hooked up to nothing
yet (Phase 5 fills it).

Tests: 15 query-layer tests against a fresh ingest of the existing
fixtures; passes alongside 21 ingest tests for a 36 / 0 total.

Worker bundle dry-runs at 263 KiB (67 KiB gzip).
The deployed Worker uses @neondatabase/serverless (HTTP-only) which
can't talk to local Postgres, so callOoxmlTool is now a thin
Neon-creating wrapper around runOoxmlTool, which takes any
tagged-template sql function. Tests and the new CLI use postgres.js
against local Docker; the Worker keeps Neon.

scripts/ooxml-call.ts dispatches the same code path the Worker uses.
Five PLAN.md acceptance queries verified against the real WML closure:

  ooxml_children("w:tbl")
    -> EG_RangeMarkupElements (group, 0..unbounded), tblPr (1..1),
       tblGrid (1..1), EG_ContentRowContent (group, 0..unbounded)

  ooxml_lookup_element("w:tblGrid")
    -> type_ref={...wml-main}CT_TblGridBase; in CT_Tbl context min/max=1
       (required, per Q1)

  ooxml_attributes("w:jc")
    -> single attr 'val' (required), type_ref to ST_Jc

  ooxml_enum("w:ST_Jc")
    -> 12 values incl. start/end (Strict) and left/right (Transitional)

  ooxml_lookup_element("w:notARealElement")
    -> 'Not found' card with profile and recovery hints
…d attributeGroup walk

Three Phase 4 query bugs surfaced by review against real WML schemas.
Each one would have produced wrong structural answers from the new tools
before the dogfood window even opened.

P2 - Inheritance ordering for complexContent/extension.
  XSD says base content comes before the extension's own content
  (e.g. CT_PPr extends CT_PPrBase: pStyle and friends from base, then
  rPr/sectPr from the extension). The old code walked the chain self-
  first, so ooxml_children("w:pPr") would have surfaced the extension's
  rPr/sectPr ahead of inherited pStyle. complexContent/restriction is
  now also handled correctly: derived REPLACES base content, so the
  base is no longer included for restriction relations.

P2 - Compositor flattening across nested particles.
  order_index is local to each compositor. The old query joined
  child_edges + group_edges across ALL compositors of a type and sorted
  by order_index alone, so a nested choice's children (which restart at
  order 0) sorted before later siblings of the outer sequence. WML's
  CT_Object would have reported the inner choice's first child before
  drawing. Fixed with a recursive walkCompositor that does DFS through
  parent_compositor_id, emitting children in true document order. Each
  ChildEdge now carries a compositorPath like
  ["sequence(1..1)", "choice(0..unbounded)"] for downstream rendering.

P2 - Recursive attributeGroup refs.
  The previous code only fetched direct xsd_attr_edges from a referenced
  group, not the group's own xsd_group_edges with ref_kind='attributeGroup'.
  VML's AG_AllCoreAttributes -> AG_CoreAttributes -> AG_Id/AG_Style chain
  would have lost most attributes. Now collectAttrsFromAttributeGroup
  recurses with a visited-set guard against cycles, so nested
  attributeGroup chains unfold completely.

Tests:
  - 3 new query-layer tests cover each fix path against fixtures:
    CT_DerivedExtended verifies extension order, CT_NestedOrder verifies
    nested compositor flatten, CT_NestedAttrUser verifies nested
    attributeGroup chain.
  - Fixture main.xsd grows: CT_BaseWithChildren / CT_DerivedExtended /
    CT_NestedOrder / AG_Inner / AG_Outer / CT_NestedAttrUser. Existing
    ingest counts updated to match.

Test infra: bun's default 5s timeout was tight for the WML smoke ingest
on a busy DB; bumped that test to 30s. test runner now sequences the
three test directories so the WML smoke and the fixture-ingest tests do
not race for the same connection pool.

39 / 0 across db / ingest / mcp-server.
… profile; drop em dash

Three review-flagged correctness gaps before flipping ENABLE_OOXML_TOOLS.

P1 - Local element symbols collapsed across complexTypes.
  Inline <xsd:element name="X" type="..."/> declared inside two different
  complexTypes was deduped under (vocabulary, name, kind), keeping only the
  first-seen type_ref. Real WML hits this on tblGrid alone: declared as
  CT_TblGridBase inside CT_TblGridChange and as CT_TblGrid inside CT_Tbl.
  ooxml_children("w:tblGrid") would have followed CT_TblGridBase and missed
  the CT_TblGrid children.

  Migration 0004 adds xsd_symbols.parent_symbol_id (nullable) and replaces
  the 3-tuple unique with a 4-tuple UNIQUE NULLS NOT DISTINCT (vocab,
  local_name, kind, parent_symbol_id). Top-level decls keep parent NULL and
  still collide on name; local decls are scoped to their owner. ingest.ts
  passes the owning symbol id when upserting local element symbols.

P2 - xsd-builtin symbols had no profile membership.
  The on-demand inheritance pass created xsd:string / xsd:boolean / etc.
  via upsertSymbol but never called linkSymbolToProfile, so lookupSymbol
  (which JOINs xsd_symbol_profiles) returned null. Following an element's
  type_ref into a W3C built-in silently failed. Now also ensure the
  xs/xsd namespace exists and link the built-in symbol into the target
  profile.

P3 - Em dash in code comment.
  scripts/ingest-xsd/qname.ts line 12 used "—". Replaced with "-".

Tests:
  - New "local element symbols are scoped per-owner (no cross-CT collapse)"
    against a CT_OuterA / CT_OuterB fixture mirroring the WML tblGrid
    pattern: each `shared` element resolves to its own symbol with the
    correct per-owner type_ref.
  - New "xsd-builtin symbols have profile membership" verifies
    lookupSymbolByTypeRef succeeds for {...XMLSchema}string.
  - Existing fixture and WML smoke counts adjusted.

41 / 0 across db / ingest / mcp-server.
The repo had grown two ingestion pipelines (PDF prose corpus and XSD
schema graph) without making the duality obvious. Header comments
referenced internal "Phase N" planning vocabulary that doesn't help
public readers, and a tool-output line shipped a forward reference
to a future phase to every MCP caller.

Reorganization:
  scripts/ingest/         -> scripts/ingest-pdf/   (was ambiguous)
  scripts/ingest-xsd/     stays
  scripts/fetch-xsd.ts    -> scripts/ingest-xsd/fetch.ts (sibling layout)
  scripts/sync-sources.ts -> scripts/sources-sync.ts (verb-style name)
  scripts/ingest-pdf/extract-pdf.py -> extract.py
  db/migrations/0003_phase3_metadata.sql -> 0003_xsd_metadata.sql
  scripts/ingest-xsd/smoke.ts            removed (debug-only, low value)

Renamed npm scripts to match the new directory layout:
  ingest          -> pdf:ingest
  ingest:chunk    -> pdf:chunk
  ingest:embed    -> pdf:embed
  ingest:upload   -> pdf:upload
  ingest:setup    -> pdf:setup
  db:sync-sources -> sources:sync
  xsd:smoke       removed

Strip "Phase N" markers from migration headers, source-file headers,
test-file headers, and inline comments. None of those references were
load-bearing; they were artifacts of the planning doc.

Drop the user-facing "_behavior notes: none yet (Phase 5)._" line that
shipped in every children/attributes/enum tool response. The line gave
no information when notes are absent and exposed an internal phase
label to the public.

Replace the lone PLAN.md reference in scripts/ingest-xsd/ingest.ts
with self-contained context. PLAN.md is gitignored; pointing at it
was a broken link for anyone reading the public repo.

Add scripts/ingest-pdf/README.md and scripts/ingest-xsd/README.md so
each pipeline is documented at the level that contributors land at,
and refresh CLAUDE.md to make the two corpora explicit and surface
both flavors of MCP tools.

41 / 0 across db / ingest / mcp-server.
Five real issues from review against the WML schema graph; one cheap
DX win folded in.

Issue 1 - lookupSymbol returned local-only symbols by qname.
  After per-owner scoping landed, lookupElement("w:tblGrid") could
  return either CT_TblGridBase or CT_TblGrid depending on which row
  postgres picked first. Fixed: lookupSymbol now filters
  parent_symbol_id IS NULL, so it only returns top-level symbols
  addressable by qname. Local elements are reachable through
  getChildren on their owning type.

Issues 3 + 4 - getAttributes mishandled inheritance.
  complexContent/restriction inherits attribute uses per XSD
  §3.4.2.2; only `use="prohibited"` drops them. The previous code
  walked the base only for `extension`, so every WML *Change type
  built on a restriction reported zero base attrs (id, author, date
  silently missing). The walk order also had base attrs emitted
  first, making the older docstring claim about "derived wins" wrong
  in practice. Fixed: derived attrs (and their attributeGroup refs)
  emit first, then the base is walked for both extension AND
  restriction; seenAttrs dedup makes derived redeclarations win.
  Two new tests pin both behaviors with a CT_TrackedBase /
  CT_TrackedRestricted / CT_OverrideDerived fixture.

Issue 5 - stale unscoped local symbols on dev DBs.
  The migration that introduced parent_symbol_id never purged the
  pre-migration parent-NULL collapsed rows, so re-ingest left them
  alongside the new per-owner symbols. Fixed: ingest now purges
  everything this source previously wrote at the start of the
  transaction, then rewrites. Non-cascading FKs
  (xsd_inheritance_edges.base_symbol_id and friends) are cleaned
  explicitly first. Idempotency test updated to reflect the new
  semantics: every stat equals first.X across re-runs and DB row
  counts stay stable.

Issue 2 - tests TRUNCATE through DATABASE_URL.
  A developer with DATABASE_URL pointed at Neon could wipe their
  schema graph by running `bun test`. Fixed: tests now require
  TEST_DATABASE_URL (no fallback) and refuse to run unless the
  hostname is local. Shared guard at tests/test-db.ts; package.json
  test script defaults TEST_DATABASE_URL to local Postgres.

DX - ooxml_children's group fallback inlined a 28-line copy of
lookupSymbol. Replaced with a 4-line lookupSymbol("group", ...)
call, and dropped a dead-code branch in getChildrenRecursive that
re-set source="inherited" on entries the recursive call had
already labeled.

44/0 across db / ingest / mcp-server.
The flag existed so the new structural tools could land without
affecting api.ooxml.dev/mcp's surface until prod was ready. The
operational plan is to populate the prod schema graph before merging,
at which point the flag is just friction. Drops:

  - ENABLE_OOXML_TOOLS env var on Env / OoxmlEnv
  - ooxmlToolsEnabled() and the gating in tools/list and tools/call
  - The defensive 'method-not-found while flag is off' branch

Both tools/list and tools/call now expose the OOXML tools
unconditionally. Worker bundle builds clean; 44 / 0 tests still pass.

Note: prod populate must run before this merges. The current per-row
INSERT pattern is slow against Neon over public internet
(~10-20 minutes for the WML closure); batching is the next step
operationally.
…al tools in README

- Remove scripts/ooxml-call.ts: 23 query-layer tests cover the same
  dispatch path. The harness was load-bearing only while we were
  verifying e2e before tests existed.
- Add data/README.md describing what each subpath under data/ is for
  (sources.json committed manifest, xsd-cache gitignored cache,
  behavior-notes future curated content).
- Update README.md to list the structural tools alongside semantic
  search; both flavors share one MCP endpoint after the prod populate.
- Update CLAUDE.md and scripts/ingest-xsd/README.md to drop ooxml:call
  references; smoke testing now points at tests/mcp-server/.

44/0 still.
`bun run xsd:fetch` no longer requires --url. The script reads the
URL and expected sha256 from data/sources.json's ecma-376-transitional
entry by default; CLI flags and XSD_PART4_URL still override for
testing a new edition before pinning it.

Common case becomes a single command:

  bun run xsd:fetch

The manifest is the canonical pin (already used to upsert
reference_sources via sources:sync), so making it the default for the
fetch script keeps a single source of truth instead of asking
contributors to remember a long URL or paste it into a .env.

Docs (CLAUDE.md, scripts/ingest-xsd/README.md) updated to show the
short form and explain how to override.
Replace the single `ecma-376` placeholder with four part-specific
entries (`ecma-376-part1` through `ecma-376-part4`), each pinned with
URL, edition (5th), publication date, and sha256. Part 2 is the 2021
revision; Part 3 is from 2015; Parts 1 and 4 are 2016 - reflected in
each entry's version field.

The Part 4 zip URL is shared with the existing ecma-376-transitional
entry (the XSD zip is extracted from inside Part 4); both rows pin the
same outer-zip sha256.

scripts/sources-sync.ts now backfills spec_content.source_id by
part_number, mapping each row to the matching ecma-376-partN source.
The previous backfill targeted a single `ecma-376` source which no
longer exists.

Migration 0005 cleans up the legacy `ecma-376` placeholder row from
reference_sources, but only if no spec_content row references it
(safe for a developer who had already backfilled to the placeholder
id; idempotent).

44 / 0 tests still passing.
@caio-pizzol caio-pizzol merged commit 38b6457 into main Apr 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants