Skip to content

Commit ad4e53e

Browse files
akoclaude
andcommitted
docs: add schema extract proposal — empirical metamodel extraction via MCP/PED
Proposes a `mxcli schema extract` command that connects to Studio Pro via MCP, creates minimal example documents for every reachable element type, decodes the resulting mxunit BSON files, and writes a structured `{version}-extracted.json` schema file. Covers four extraction paths (microflow/nanoflow actions, built-in widgets, pluggable widget .mpk XML), the reverse-engineering algorithm for reference kind and list encoding type, and explicit positioning relative to the BSON Schema Registry proposal (PR #335) and retran's .mxcore suggestion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent baa8253 commit ad4e53e

1 file changed

Lines changed: 291 additions & 0 deletions

File tree

Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
# Proposal: `mxcli schema extract` — Empirical Metamodel Schema Extraction
2+
3+
## Problem
4+
5+
The BSON Schema Registry proposal (see `BSON_SCHEMA_REGISTRY_PROPOSAL.md`) requires accurate
6+
per-version type metadata to drive serialization, validation, and default-value filling. That
7+
metadata currently comes from two sources, both with reliability problems:
8+
9+
1. **Reflection data** (`reference/mendixmodellib/reflection-data/`) — extracted from the
10+
TypeScript `mendixmodelsdk` npm package. Stale at Mendix 11.6. Requires manual update each
11+
release. Uses TypeScript SDK names that differ from BSON storage names (e.g. SDK says
12+
`CreateObjectAction`, BSON stores `CreateChangeAction`).
13+
14+
2. **`supplements.json`** (PR #335) — hand-maintained bridge between SDK type names and BSON
15+
reality. Grows with every release. Gaps are discovered at runtime when Studio Pro rejects
16+
written output.
17+
18+
Neither source is self-updating, and neither is verifiable against actual Studio Pro behavior
19+
without opening a project and checking for errors.
20+
21+
## Insight
22+
23+
When Studio Pro opens a Mendix project, its MCP server exposes the full metamodel through the
24+
PED (Progressive Element Discovery) API. By combining three observations:
25+
26+
1. `ped_get_schema` returns the complete property structure and valid enum values for any
27+
element type — including the exhaustive `allowedTypes` list for every abstract extension
28+
point (e.g. all 40+ microflow action types).
29+
30+
2. Documents created via PED are serialised to `.mxunit` BSON files on disk immediately.
31+
Decoding those files reveals the exact `$Type` storage name and BSON field encoding that
32+
Studio Pro actually writes — no inference needed.
33+
34+
3. The BSON value type is deterministic: a `{"Subtype": N, "Data": "..."}` binary value is a
35+
BY_ID reference; a plain string is BY_NAME; a leading integer in an array is the list
36+
encoding type (1 = compact, 2 = key-value, 3 = object array).
37+
38+
These three observations together produce a complete, verifiable schema for every element type
39+
reachable from the Mendix model — sourced directly from Studio Pro rather than from a generated
40+
TypeScript artifact.
41+
42+
## Proposed Command
43+
44+
```
45+
mxcli schema extract [--output dir] [--version label] [--domains domain1,domain2,...]
46+
```
47+
48+
Connects to a running Studio Pro instance via MCP, creates minimal example documents for every
49+
reachable element type, decodes the resulting `.mxunit` files, and writes a structured schema
50+
JSON file.
51+
52+
```bash
53+
# Extract schema for whatever Studio Pro version is currently open
54+
mxcli schema extract --output reference/mendixmodellib/reflection-data/
55+
56+
# Extract only microflow and domain model domains
57+
mxcli schema extract --domains microflows,domainmodels --output ./schemas/
58+
59+
# Dry run: print coverage report without writing files
60+
mxcli schema extract --dry-run
61+
```
62+
63+
## Extraction Paths
64+
65+
The command uses four distinct extraction paths depending on the element category.
66+
67+
### Path 1: Microflow and nanoflow activities
68+
69+
**Enumeration**: `ped_get_schema(["Microflows$ActionActivity"])` returns the full
70+
`allowedTypes` map for the `action` property — the authoritative list of all action types
71+
(currently 40+).
72+
73+
**Schema extraction**: For each action type:
74+
1. Create a minimal `Microflows$Microflow` in a scratch module with one `ActionActivity`
75+
containing that action, plus required Start/End events.
76+
2. Decode the resulting `.mxunit` file.
77+
3. Walk every field in the decoded BSON and classify:
78+
- `{"Subtype": N, "Data": "..."}``referenceKind: BY_ID`
79+
- plain string → `referenceKind: BY_NAME`
80+
- dict with `$Type``referenceKind: PART`
81+
- `[1, ...]``listEncoding: compact`
82+
- `[2, ...]``listEncoding: keyValue`
83+
- `[3, ...]``listEncoding: objectArray`
84+
4. Record the `$Type` storage name (which may differ from the PED type name, e.g.
85+
`Microflows$CreateObjectAction``Microflows$CreateChangeAction`).
86+
87+
**Nanoflow subset**: Repeat with `Microflows$Nanoflow`. Actions that are disallowed in
88+
nanoflows fail `ped_check_errors` with a recognisable error; all others produce a valid schema.
89+
The disallowed set is recorded as metadata on each action type.
90+
91+
**Preconditions**: Some action types require an existing entity, microflow, or page. The
92+
extractor creates stubs automatically (a `_SchemaExtract` scratch module with a bare entity, a
93+
trivial nanoflow, etc.) and tears them down after extraction.
94+
95+
### Path 2: Built-in widgets (`Forms$` types)
96+
97+
**Enumeration**: The 78 `Forms$` widget types are known from the existing codebase. PED's page
98+
schema does not enumerate them; instead the extractor uses the known list as its input.
99+
100+
**Schema extraction**: For each widget type:
101+
1. Use the pagegen tools to create a minimal page containing one instance of the widget.
102+
2. Decode the resulting `.mxunit` file.
103+
3. Apply the same field classification as Path 1.
104+
105+
Because the pagegen API accepts widget types by name, this is fully automatable without manual
106+
page construction.
107+
108+
### Path 3: Pluggable widgets (`.mpk` packages)
109+
110+
**Enumeration**: Scan the project's `widgets/` directory for `.mpk` files.
111+
112+
**Schema extraction**: Each `.mpk` is a ZIP archive. Extract `{WidgetName}.xml` — a structured
113+
property definition that is the *source* from which Studio Pro generates the BSON
114+
`CustomWidgetType` PropertyTypes blob. Parse it directly:
115+
116+
```xml
117+
<property key="source" type="enumeration" defaultValue="context" required="true">
118+
<property key="attributeEnumeration" type="attribute" required="true">
119+
<property key="optionsSourceDatabaseDataSource" type="datasource" isList="true">
120+
```
121+
122+
This is preferable to the mxunit roundtrip for widgets because the XML is the canonical
123+
definition — decoding BSON would only recover a derivative representation of it.
124+
125+
The extractor records the widget ID, version, and the full property tree. This replaces the
126+
current `sdk/widgets/templates/` embedded templates with a live, project-accurate schema.
127+
128+
### Path 4: Domain model and other document types
129+
130+
**Enumeration**: `ped_get_schema` for all top-level document types and their nested element
131+
types, traversing the schema graph from each document root.
132+
133+
**Schema extraction**: Create a minimal instance of each type (entity, enumeration,
134+
association, etc.), decode the mxunit, and classify fields using the same algorithm.
135+
136+
## Output Format
137+
138+
The extractor writes one JSON file per Mendix version:
139+
140+
```
141+
reference/mendixmodellib/reflection-data/
142+
11.9.0-extracted.json ← new format, one file replaces -structures + -storageNames pair
143+
11.8.0-extracted.json
144+
...
145+
```
146+
147+
Schema JSON structure (one entry per element type):
148+
149+
```json
150+
{
151+
"version": "11.9.0",
152+
"extractedAt": "2026-05-02T14:00:00Z",
153+
"types": {
154+
"Microflows$CreateChangeAction": {
155+
"pedName": "Microflows$CreateObjectAction",
156+
"storageName": "Microflows$CreateChangeAction",
157+
"properties": {
158+
"Entity": {
159+
"referenceKind": "BY_NAME",
160+
"referredType": "DomainModels$Entity",
161+
"listEncoding": null,
162+
"default": ""
163+
},
164+
"Items": {
165+
"referenceKind": "PART",
166+
"referredType": "Microflows$ChangeActionItem",
167+
"listEncoding": "keyValue",
168+
"default": []
169+
},
170+
"Commit": {
171+
"referenceKind": null,
172+
"referredType": null,
173+
"listEncoding": null,
174+
"enumValues": ["Yes", "YesWithoutEvents", "No"],
175+
"default": "No"
176+
}
177+
},
178+
"allowedInNanoflow": false,
179+
"introducedIn": null,
180+
"removedIn": null
181+
}
182+
},
183+
"widgets": {
184+
"com.mendix.widget.web.combobox.Combobox": {
185+
"version": "2.5.0",
186+
"source": "mpk",
187+
"properties": { ... }
188+
}
189+
}
190+
}
191+
```
192+
193+
## Version Lifecycle
194+
195+
A single extraction run produces the schema for the version currently open in Studio Pro. To
196+
populate version lifecycle fields (`introducedIn`, `removedIn`), run extraction against
197+
multiple Studio Pro versions and diff the outputs:
198+
199+
```bash
200+
# Run against Studio Pro 10.24, then 11.9
201+
mxcli schema extract --version 10.24.0 --output schemas/
202+
mxcli schema extract --version 11.9.0 --output schemas/
203+
204+
# Diff to compute property lifecycle
205+
mxcli schema diff schemas/10.24.0-extracted.json schemas/11.9.0-extracted.json
206+
```
207+
208+
This is the same information the `.version.mxcore` files encode in the internal appdev
209+
monorepo — but derived empirically rather than from a canonical source. For most practical
210+
purposes (knowing whether a property exists for a given project version) this is sufficient.
211+
212+
## Relationship to Existing Proposals
213+
214+
### BSON Schema Registry (PR #335)
215+
216+
The schema registry proposal requires accurate per-version type metadata but doesn't address
217+
how to keep that metadata current. `mxcli schema extract` is the data pipeline that feeds the
218+
registry. The registry's `LoadRegistry(version)` function would load `{version}-extracted.json`
219+
instead of the current `-structures.json` + `-storageNames.json` pair.
220+
221+
### PR #335 `supplements.json`
222+
223+
Most entries in `supplements.json` exist because the JS-parsed SDK doesn't capture BSON storage
224+
names, list encodings, or reference kinds. The extracted schema contains all of these directly.
225+
Once extraction covers the relevant domains, `supplements.json` shrinks to only the genuinely
226+
exotic cases.
227+
228+
### retran's `.mxcore` suggestion (PR #335 review)
229+
230+
The `.mxcore` DSL in the internal appdev monorepo is the canonical source for the same
231+
metadata. `mxcli schema extract` is the external-contributor-accessible equivalent: it derives
232+
the same structural facts empirically. If Mendix publishes a minimal derived artifact from
233+
`.mxcore`, the extraction pipeline could be simplified or replaced; the output format is
234+
designed to be compatible with either source.
235+
236+
## Coverage
237+
238+
| Domain | Types | Extraction path | Completeness |
239+
|---|---|---|---|
240+
| Microflow actions | 40+ | PED allowedTypes + mxunit | Full |
241+
| Nanoflow actions | subset of above | Try + ped_check_errors | Full |
242+
| Domain model | ~15 | PED schema traversal + mxunit | Full |
243+
| Built-in widgets | 78 | Known list + pagegen + mxunit | Full |
244+
| Pluggable widgets | project-dependent | `.mpk` XML parse | Full (per project) |
245+
| Other document types | ~40 domains | PED schema traversal + mxunit | Partial (reachable types only) |
246+
247+
## Implementation Plan
248+
249+
### Phase 1: Core extractor (microflow actions)
250+
- Connect to Studio Pro MCP, read PED type catalog for `ActionActivity`
251+
- Create scratch module, create one microflow per action type
252+
- Decode mxunit, classify fields, write `{version}-extracted.json`
253+
- Validate: compare extracted storage names against current CLAUDE.md table
254+
255+
### Phase 2: Domain model and nanoflow subset
256+
- Extend to domain model types (entity, attribute, association, enumeration)
257+
- Extend to nanoflow: run same list, record disallowed set
258+
- Add precondition scaffold (stub entity, stub nanoflow, etc.)
259+
260+
### Phase 3: Built-in widgets
261+
- Implement pagegen → mxunit decode path for `Forms$` widget types
262+
- Add widget fields to extracted schema
263+
264+
### Phase 4: Pluggable widget `.mpk` parsing
265+
- Implement ZIP + XML parser for widget property definitions
266+
- Record widget ID, version, property tree
267+
- Emit as `widgets` section of extracted JSON
268+
269+
### Phase 5: Integration with schema registry
270+
- Update `LoadRegistry` to accept the new extracted format
271+
- Remove or shrink `supplements.json` based on extraction coverage
272+
- Add `mxcli schema diff` for version lifecycle computation
273+
274+
## Open Questions
275+
276+
1. **MCP availability**: The extractor requires a running Studio Pro instance. Should there be
277+
a fallback that uses the existing reflection data when no MCP connection is available, or
278+
should extraction always be an explicit developer operation?
279+
280+
2. **Scratch module cleanup**: The extractor creates a `_SchemaExtract` module and deletes it
281+
after. What happens if extraction is interrupted mid-run? Should it be idempotent (detect
282+
and reuse an existing scratch module)?
283+
284+
3. **Frequency**: Schema extraction only needs to run when a new Studio Pro version is
285+
targeted. Should this be a manual developer command, a CI step, or triggered automatically
286+
on `mxcli setup mxbuild`?
287+
288+
4. **Non-extractable types**: Some types can't be instantiated without complex preconditions
289+
(published REST services, business events, OData contracts). How should the extractor handle
290+
types it cannot reach? Flag as `"source": "manual"` and fall back to existing reflection
291+
data for those domains.

0 commit comments

Comments
 (0)