Skip to content

Commit 8945b7f

Browse files
committed
feat(xsd): transcode extract_classes.py to Rust — byte-faithful, closes XSD↔TTL bijection, drops the Python oracle dep
The MARS XSD classification extractor (arago/MARS-Schema/tools/extract_classes.py, ~360 lines: ~140 extraction logic + ~150 table formatting) is now a faithful Rust transcode at crates/ogar-from-schema/src/xsd.rs, behind an optional `xsd` feature (pulls roxmltree, pure-Rust read-only XML DOM; the default TTL path stays zero-parser-deps). Not huge — the transcode is ~350 LOC Rust including tests. And it doubles as the seed of the broader XSD → Class front-end: the same walk that extracts classifications is the structural-arm lift for any XSD schema. What it lands: * BYTE-FOR-BYTE transcode proof. xsd::to_asciidoc() reproduces the Python `-F asciidoc` output exactly — 628 lines, including verbatim XSD-documentation whitespace and the printAsciiDocFooter trailing newline. Test xsd::tests::asciidoc_matches_python_oracle diffs against the cached _oracle/classifications.adoc. * XSD↔TTL BIJECTION CLOSED (was "queued" in MARS-TRANSCODING.md §2). xsd::tests::xsd_classes_match_ttl_enum asserts full bidirectional set-equality between the XSD-extracted Application value set and the TTL validation-parameter enum — not just one-directional membership. Two independent encodings of one taxonomy, provably equal both ways. * PYTHON DEPENDENCY REMOVED from the calibration path. `cargo test --features xsd` is the whole oracle now; no python3 interpreter needed. extract_classes.py stays vendored in _oracle/ as the provenance witness (what the transcode was proven against), not a runtime dep. Transcode discipline (faithful to the Python semantics): * getAttribute("xml:lang") returns "" for absent (not None); lang filter is "absent OR en". roxmltree resolves xml: to the xml namespace, matched on attribute.name() == "lang". * getXMLText concatenates DIRECT text-node children only (not recursive); the documentation's internal whitespace is load-bearing for the byte-match. * :revdate: is datetime.now() in Python (non-deterministic); the Rust to_asciidoc(c, revdate) takes it as a parameter so output is reproducible and testable. * The two-level extension chain (master complexType carries NodeType, intermediate complexType carries Class, leaf element carries SubClass) + the post-process phase that stitches base→element is reproduced exactly, including the master_types gate. Tests: 20/20 with --features xsd (16 default + 4 new xsd); 16/16 on default (xsd code fully feature-gated). Clippy-clean (--no-deps), fmt-clean. Docs: * docs/MARS-TRANSCODING.md §2 — bijection marked closed; Python-dep removal noted; the Rust extractor added to the oracle-direction table. * .claude/board/EPIPHANIES.md — FINDING with the transcode-discipline notes for the next source→Rust port.
1 parent 7d68042 commit 8945b7f

5 files changed

Lines changed: 552 additions & 11 deletions

File tree

.claude/board/EPIPHANIES.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,58 @@
1515
1616
## Entries (newest first)
1717

18+
## 2026-06-22 — extract_classes.py transcoded to Rust byte-faithfully; XSD↔TTL bijection closed; Python dependency removed from the oracle
19+
**Status:** FINDING
20+
**Scope:** XSD front-end × calibration self-containment × the queued bijection
21+
22+
The MARS XSD classification extractor (`arago/MARS-Schema/tools/extract_classes.py`,
23+
~360 lines, ~140 logic + ~150 table formatting) is now a faithful Rust
24+
transcode at `crates/ogar-from-schema/src/xsd.rs`, behind the optional
25+
`xsd` feature (pulls `roxmltree`, a pure-Rust read-only XML DOM; the
26+
default TTL path stays zero-parser-deps).
27+
28+
Three things this lands:
29+
30+
1. **Byte-for-byte transcode proof.** `xsd::to_asciidoc()` reproduces
31+
the Python `-F asciidoc` output exactly — 628 lines, including the
32+
verbatim XSD-documentation whitespace and the `printAsciiDocFooter`
33+
trailing newline. Test: `xsd::tests::asciidoc_matches_python_oracle`
34+
diffs against the cached `_oracle/classifications.adoc`.
35+
36+
2. **The XSD↔TTL bijection is closed (was "queued" in
37+
`MARS-TRANSCODING.md §2`).** `xsd::tests::xsd_classes_match_ttl_enum`
38+
asserts FULL bidirectional set-equality between the XSD-extracted
39+
Application value set and the TTL `validation-parameter` enum — not
40+
just one-directional membership. The XSD and the TTL are two
41+
independent encodings of one taxonomy and they now provably agree
42+
in both directions.
43+
44+
3. **The Python dependency is removed from the calibration path.**
45+
`cargo test --features xsd` is the whole oracle now; no `python3`
46+
interpreter needed. `extract_classes.py` stays vendored in
47+
`_oracle/` as the provenance witness (what the transcode was proven
48+
against), not a runtime dep.
49+
50+
Transcode discipline notes (for the next source→Rust port):
51+
- The Python `getAttribute("xml:lang")` returns `""` for absent (not
52+
`None`); the lang-filter is "absent OR en". roxmltree resolves `xml:`
53+
to the xml namespace — match on `attribute.name() == "lang"`.
54+
- `getXMLText` concatenates DIRECT text-node children only (not
55+
recursive); the documentation's internal whitespace is load-bearing
56+
for the byte-match.
57+
- The `:revdate:` is `datetime.now()` in Python (non-deterministic);
58+
the Rust `to_asciidoc(c, revdate)` takes it as a parameter so the
59+
output is reproducible and testable.
60+
61+
Answer to "is it huge": no — ~360 lines, half output formatting; the
62+
transcode is ~350 LOC Rust including tests. And it doubles as the seed
63+
of the broader XSD→`Class` front-end (the same walk that extracts
64+
classifications is the structural-arm lift for any XSD).
65+
66+
Evidence: `crates/ogar-from-schema/src/xsd.rs` (20/20 tests pass with
67+
`--features xsd`; 16/16 on default). `docs/MARS-TRANSCODING.md §2`
68+
updated to mark the bijection closed.
69+
1870
## 2026-06-22 — OGIT is the canonical template store; Odoo (and any source-AST producer) digests INTO it; consumers relive agnostically via askama
1971
**Status:** FRAMING
2072
**Scope:** Foundry-parity collapse × cross-consumer architecture × digest-once-relive-N

crates/ogar-from-schema/Cargo.toml

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,21 @@ description = "Schema-as-input producer family for OGAR IR. Lifts the STRUCTURAL
1111
[features]
1212
default = []
1313
serde = ["dep:serde", "ogar-vocab/serde"]
14+
# The `xsd` feature pulls in a read-only XML parser for the XSD
15+
# front-end (the faithful Rust transcode of arago/MARS-Schema's
16+
# extract_classes.py). Kept optional so the default TTL path stays
17+
# zero-parser-deps.
18+
xsd = ["dep:roxmltree"]
1419

1520
[dependencies]
1621
ogar-vocab = { path = "../ogar-vocab" }
1722
serde = { workspace = true, optional = true }
23+
roxmltree = { version = "0.20", optional = true }
1824

19-
# No external TTL/XSD parser pulled in yet: the v0 reader uses a
20-
# narrow line-oriented walker that handles the shapes OGIT's TTL
21-
# actually emits (a tiny, machine-stable subset of full Turtle).
22-
# When the surface grows (Wikidata-shaped TTL, full RDF/XML, OWL
23-
# imports), swap in `oxttl` / `oxrdf` here without touching the
25+
# The default TTL reader uses a narrow line-oriented walker that
26+
# handles the shapes OGIT's TTL actually emits (a tiny, machine-stable
27+
# subset of full Turtle) — no external parser. The `xsd` feature adds
28+
# roxmltree (pure-Rust, read-only DOM) ONLY for the XSD front-end.
29+
# When the TTL surface grows (Wikidata-shaped TTL, full RDF/XML, OWL
30+
# imports), swap in `oxttl` / `oxrdf` without touching the
2431
# Class/Attribute target shape this crate produces.

crates/ogar-from-schema/src/lib.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ use ogar_vocab::{Attribute, Class, EnumDecl, EnumSource, Language};
6060
pub mod sgo;
6161
pub mod ttl;
6262
pub mod ttl_emit;
63+
#[cfg(feature = "xsd")]
64+
pub mod xsd;
6365

6466
/// What a single TTL file describes — exactly one of: an entity (`Class`),
6567
/// a datatype attribute (`Attribute`), or a verb (`Association` shape).

0 commit comments

Comments
 (0)