|
| 1 | +# Proposal: `mxcli schema extract` — Empirical Metamodel Schema Extraction |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +The BSON Schema Registry proposal (see `BSON_SCHEMA_REGISTRY_PROPOSAL.md`) requires accurate |
| 6 | +per-version type metadata to drive serialization, validation, and default-value filling. That |
| 7 | +metadata currently comes from two sources, both with reliability problems: |
| 8 | + |
| 9 | +1. **Reflection data** (`reference/mendixmodellib/reflection-data/`) — extracted from the |
| 10 | + TypeScript `mendixmodelsdk` npm package. Stale at Mendix 11.6. Requires manual update each |
| 11 | + release. Uses TypeScript SDK names that differ from BSON storage names (e.g. SDK says |
| 12 | + `CreateObjectAction`, BSON stores `CreateChangeAction`). |
| 13 | + |
| 14 | +2. **`supplements.json`** (PR #335) — hand-maintained bridge between SDK type names and BSON |
| 15 | + reality. Grows with every release. Gaps are discovered at runtime when Studio Pro rejects |
| 16 | + written output. |
| 17 | + |
| 18 | +Neither source is self-updating, and neither is verifiable against actual Studio Pro behavior |
| 19 | +without opening a project and checking for errors. |
| 20 | + |
| 21 | +## Insight |
| 22 | + |
| 23 | +When Studio Pro opens a Mendix project, its MCP server exposes the full metamodel through the |
| 24 | +PED (Progressive Element Discovery) API. By combining three observations: |
| 25 | + |
| 26 | +1. `ped_get_schema` returns the complete property structure and valid enum values for any |
| 27 | + element type — including the exhaustive `allowedTypes` list for every abstract extension |
| 28 | + point (e.g. all 40+ microflow action types). |
| 29 | + |
| 30 | +2. Documents created via PED are serialised to `.mxunit` BSON files on disk immediately. |
| 31 | + Decoding those files reveals the exact `$Type` storage name and BSON field encoding that |
| 32 | + Studio Pro actually writes — no inference needed. |
| 33 | + |
| 34 | +3. The BSON value type is deterministic: a `{"Subtype": N, "Data": "..."}` binary value is a |
| 35 | + BY_ID reference; a plain string is BY_NAME; a leading integer in an array is the list |
| 36 | + encoding type (1 = compact, 2 = key-value, 3 = object array). |
| 37 | + |
| 38 | +These three observations together produce a complete, verifiable schema for every element type |
| 39 | +reachable from the Mendix model — sourced directly from Studio Pro rather than from a generated |
| 40 | +TypeScript artifact. |
| 41 | + |
| 42 | +## Proposed Command |
| 43 | + |
| 44 | +``` |
| 45 | +mxcli schema extract [--output dir] [--version label] [--domains domain1,domain2,...] |
| 46 | +``` |
| 47 | + |
| 48 | +Connects to a running Studio Pro instance via MCP, creates minimal example documents for every |
| 49 | +reachable element type, decodes the resulting `.mxunit` files, and writes a structured schema |
| 50 | +JSON file. |
| 51 | + |
| 52 | +```bash |
| 53 | +# Extract schema for whatever Studio Pro version is currently open |
| 54 | +mxcli schema extract --output reference/mendixmodellib/reflection-data/ |
| 55 | + |
| 56 | +# Extract only microflow and domain model domains |
| 57 | +mxcli schema extract --domains microflows,domainmodels --output ./schemas/ |
| 58 | + |
| 59 | +# Dry run: print coverage report without writing files |
| 60 | +mxcli schema extract --dry-run |
| 61 | +``` |
| 62 | + |
| 63 | +## Extraction Paths |
| 64 | + |
| 65 | +The command uses four distinct extraction paths depending on the element category. |
| 66 | + |
| 67 | +### Path 1: Microflow and nanoflow activities |
| 68 | + |
| 69 | +**Enumeration**: `ped_get_schema(["Microflows$ActionActivity"])` returns the full |
| 70 | +`allowedTypes` map for the `action` property — the authoritative list of all action types |
| 71 | +(currently 40+). |
| 72 | + |
| 73 | +**Schema extraction**: For each action type: |
| 74 | +1. Create a minimal `Microflows$Microflow` in a scratch module with one `ActionActivity` |
| 75 | + containing that action, plus required Start/End events. |
| 76 | +2. Decode the resulting `.mxunit` file. |
| 77 | +3. Walk every field in the decoded BSON and classify: |
| 78 | + - `{"Subtype": N, "Data": "..."}` → `referenceKind: BY_ID` |
| 79 | + - plain string → `referenceKind: BY_NAME` |
| 80 | + - dict with `$Type` → `referenceKind: PART` |
| 81 | + - `[1, ...]` → `listEncoding: compact` |
| 82 | + - `[2, ...]` → `listEncoding: keyValue` |
| 83 | + - `[3, ...]` → `listEncoding: objectArray` |
| 84 | +4. Record the `$Type` storage name (which may differ from the PED type name, e.g. |
| 85 | + `Microflows$CreateObjectAction` → `Microflows$CreateChangeAction`). |
| 86 | + |
| 87 | +**Nanoflow subset**: Repeat with `Microflows$Nanoflow`. Actions that are disallowed in |
| 88 | +nanoflows fail `ped_check_errors` with a recognisable error; all others produce a valid schema. |
| 89 | +The disallowed set is recorded as metadata on each action type. |
| 90 | + |
| 91 | +**Preconditions**: Some action types require an existing entity, microflow, or page. The |
| 92 | +extractor creates stubs automatically (a `_SchemaExtract` scratch module with a bare entity, a |
| 93 | +trivial nanoflow, etc.) and tears them down after extraction. |
| 94 | + |
| 95 | +### Path 2: Built-in widgets (`Forms$` types) |
| 96 | + |
| 97 | +**Enumeration**: The 78 `Forms$` widget types are known from the existing codebase. PED's page |
| 98 | +schema does not enumerate them; instead the extractor uses the known list as its input. |
| 99 | + |
| 100 | +**Schema extraction**: For each widget type: |
| 101 | +1. Use the pagegen tools to create a minimal page containing one instance of the widget. |
| 102 | +2. Decode the resulting `.mxunit` file. |
| 103 | +3. Apply the same field classification as Path 1. |
| 104 | + |
| 105 | +Because the pagegen API accepts widget types by name, this is fully automatable without manual |
| 106 | +page construction. |
| 107 | + |
| 108 | +### Path 3: Pluggable widgets (`.mpk` packages) |
| 109 | + |
| 110 | +**Enumeration**: Scan the project's `widgets/` directory for `.mpk` files. |
| 111 | + |
| 112 | +**Schema extraction**: Each `.mpk` is a ZIP archive. Extract `{WidgetName}.xml` — a structured |
| 113 | +property definition that is the *source* from which Studio Pro generates the BSON |
| 114 | +`CustomWidgetType` PropertyTypes blob. Parse it directly: |
| 115 | + |
| 116 | +```xml |
| 117 | +<property key="source" type="enumeration" defaultValue="context" required="true"> |
| 118 | +<property key="attributeEnumeration" type="attribute" required="true"> |
| 119 | +<property key="optionsSourceDatabaseDataSource" type="datasource" isList="true"> |
| 120 | +``` |
| 121 | + |
| 122 | +This is preferable to the mxunit roundtrip for widgets because the XML is the canonical |
| 123 | +definition — decoding BSON would only recover a derivative representation of it. |
| 124 | + |
| 125 | +The extractor records the widget ID, version, and the full property tree. This replaces the |
| 126 | +current `sdk/widgets/templates/` embedded templates with a live, project-accurate schema. |
| 127 | + |
| 128 | +### Path 4: Domain model and other document types |
| 129 | + |
| 130 | +**Enumeration**: `ped_get_schema` for all top-level document types and their nested element |
| 131 | +types, traversing the schema graph from each document root. |
| 132 | + |
| 133 | +**Schema extraction**: Create a minimal instance of each type (entity, enumeration, |
| 134 | +association, etc.), decode the mxunit, and classify fields using the same algorithm. |
| 135 | + |
| 136 | +## Output Format |
| 137 | + |
| 138 | +The extractor writes one JSON file per Mendix version: |
| 139 | + |
| 140 | +``` |
| 141 | +reference/mendixmodellib/reflection-data/ |
| 142 | + 11.9.0-extracted.json ← new format, one file replaces -structures + -storageNames pair |
| 143 | + 11.8.0-extracted.json |
| 144 | + ... |
| 145 | +``` |
| 146 | + |
| 147 | +Schema JSON structure (one entry per element type): |
| 148 | + |
| 149 | +```json |
| 150 | +{ |
| 151 | + "version": "11.9.0", |
| 152 | + "extractedAt": "2026-05-02T14:00:00Z", |
| 153 | + "types": { |
| 154 | + "Microflows$CreateChangeAction": { |
| 155 | + "pedName": "Microflows$CreateObjectAction", |
| 156 | + "storageName": "Microflows$CreateChangeAction", |
| 157 | + "properties": { |
| 158 | + "Entity": { |
| 159 | + "referenceKind": "BY_NAME", |
| 160 | + "referredType": "DomainModels$Entity", |
| 161 | + "listEncoding": null, |
| 162 | + "default": "" |
| 163 | + }, |
| 164 | + "Items": { |
| 165 | + "referenceKind": "PART", |
| 166 | + "referredType": "Microflows$ChangeActionItem", |
| 167 | + "listEncoding": "keyValue", |
| 168 | + "default": [] |
| 169 | + }, |
| 170 | + "Commit": { |
| 171 | + "referenceKind": null, |
| 172 | + "referredType": null, |
| 173 | + "listEncoding": null, |
| 174 | + "enumValues": ["Yes", "YesWithoutEvents", "No"], |
| 175 | + "default": "No" |
| 176 | + } |
| 177 | + }, |
| 178 | + "allowedInNanoflow": false, |
| 179 | + "introducedIn": null, |
| 180 | + "removedIn": null |
| 181 | + } |
| 182 | + }, |
| 183 | + "widgets": { |
| 184 | + "com.mendix.widget.web.combobox.Combobox": { |
| 185 | + "version": "2.5.0", |
| 186 | + "source": "mpk", |
| 187 | + "properties": { ... } |
| 188 | + } |
| 189 | + } |
| 190 | +} |
| 191 | +``` |
| 192 | + |
| 193 | +## Version Lifecycle |
| 194 | + |
| 195 | +A single extraction run produces the schema for the version currently open in Studio Pro. To |
| 196 | +populate version lifecycle fields (`introducedIn`, `removedIn`), run extraction against |
| 197 | +multiple Studio Pro versions and diff the outputs: |
| 198 | + |
| 199 | +```bash |
| 200 | +# Run against Studio Pro 10.24, then 11.9 |
| 201 | +mxcli schema extract --version 10.24.0 --output schemas/ |
| 202 | +mxcli schema extract --version 11.9.0 --output schemas/ |
| 203 | + |
| 204 | +# Diff to compute property lifecycle |
| 205 | +mxcli schema diff schemas/10.24.0-extracted.json schemas/11.9.0-extracted.json |
| 206 | +``` |
| 207 | + |
| 208 | +This is the same information the `.version.mxcore` files encode in the internal appdev |
| 209 | +monorepo — but derived empirically rather than from a canonical source. For most practical |
| 210 | +purposes (knowing whether a property exists for a given project version) this is sufficient. |
| 211 | + |
| 212 | +## Relationship to Existing Proposals |
| 213 | + |
| 214 | +### BSON Schema Registry (PR #335) |
| 215 | + |
| 216 | +The schema registry proposal requires accurate per-version type metadata but doesn't address |
| 217 | +how to keep that metadata current. `mxcli schema extract` is the data pipeline that feeds the |
| 218 | +registry. The registry's `LoadRegistry(version)` function would load `{version}-extracted.json` |
| 219 | +instead of the current `-structures.json` + `-storageNames.json` pair. |
| 220 | + |
| 221 | +### PR #335 `supplements.json` |
| 222 | + |
| 223 | +Most entries in `supplements.json` exist because the JS-parsed SDK doesn't capture BSON storage |
| 224 | +names, list encodings, or reference kinds. The extracted schema contains all of these directly. |
| 225 | +Once extraction covers the relevant domains, `supplements.json` shrinks to only the genuinely |
| 226 | +exotic cases. |
| 227 | + |
| 228 | +### retran's `.mxcore` suggestion (PR #335 review) |
| 229 | + |
| 230 | +The `.mxcore` DSL in the internal appdev monorepo is the canonical source for the same |
| 231 | +metadata. `mxcli schema extract` is the external-contributor-accessible equivalent: it derives |
| 232 | +the same structural facts empirically. If Mendix publishes a minimal derived artifact from |
| 233 | +`.mxcore`, the extraction pipeline could be simplified or replaced; the output format is |
| 234 | +designed to be compatible with either source. |
| 235 | + |
| 236 | +## Coverage |
| 237 | + |
| 238 | +| Domain | Types | Extraction path | Completeness | |
| 239 | +|---|---|---|---| |
| 240 | +| Microflow actions | 40+ | PED allowedTypes + mxunit | Full | |
| 241 | +| Nanoflow actions | subset of above | Try + ped_check_errors | Full | |
| 242 | +| Domain model | ~15 | PED schema traversal + mxunit | Full | |
| 243 | +| Built-in widgets | 78 | Known list + pagegen + mxunit | Full | |
| 244 | +| Pluggable widgets | project-dependent | `.mpk` XML parse | Full (per project) | |
| 245 | +| Other document types | ~40 domains | PED schema traversal + mxunit | Partial (reachable types only) | |
| 246 | + |
| 247 | +## Implementation Plan |
| 248 | + |
| 249 | +### Phase 1: Core extractor (microflow actions) |
| 250 | +- Connect to Studio Pro MCP, read PED type catalog for `ActionActivity` |
| 251 | +- Create scratch module, create one microflow per action type |
| 252 | +- Decode mxunit, classify fields, write `{version}-extracted.json` |
| 253 | +- Validate: compare extracted storage names against current CLAUDE.md table |
| 254 | + |
| 255 | +### Phase 2: Domain model and nanoflow subset |
| 256 | +- Extend to domain model types (entity, attribute, association, enumeration) |
| 257 | +- Extend to nanoflow: run same list, record disallowed set |
| 258 | +- Add precondition scaffold (stub entity, stub nanoflow, etc.) |
| 259 | + |
| 260 | +### Phase 3: Built-in widgets |
| 261 | +- Implement pagegen → mxunit decode path for `Forms$` widget types |
| 262 | +- Add widget fields to extracted schema |
| 263 | + |
| 264 | +### Phase 4: Pluggable widget `.mpk` parsing |
| 265 | +- Implement ZIP + XML parser for widget property definitions |
| 266 | +- Record widget ID, version, property tree |
| 267 | +- Emit as `widgets` section of extracted JSON |
| 268 | + |
| 269 | +### Phase 5: Integration with schema registry |
| 270 | +- Update `LoadRegistry` to accept the new extracted format |
| 271 | +- Remove or shrink `supplements.json` based on extraction coverage |
| 272 | +- Add `mxcli schema diff` for version lifecycle computation |
| 273 | + |
| 274 | +## Open Questions |
| 275 | + |
| 276 | +1. **MCP availability**: The extractor requires a running Studio Pro instance. Should there be |
| 277 | + a fallback that uses the existing reflection data when no MCP connection is available, or |
| 278 | + should extraction always be an explicit developer operation? |
| 279 | + |
| 280 | +2. **Scratch module cleanup**: The extractor creates a `_SchemaExtract` module and deletes it |
| 281 | + after. What happens if extraction is interrupted mid-run? Should it be idempotent (detect |
| 282 | + and reuse an existing scratch module)? |
| 283 | + |
| 284 | +3. **Frequency**: Schema extraction only needs to run when a new Studio Pro version is |
| 285 | + targeted. Should this be a manual developer command, a CI step, or triggered automatically |
| 286 | + on `mxcli setup mxbuild`? |
| 287 | + |
| 288 | +4. **Non-extractable types**: Some types can't be instantiated without complex preconditions |
| 289 | + (published REST services, business events, OData contracts). How should the extractor handle |
| 290 | + types it cannot reach? Flag as `"source": "manual"` and fall back to existing reflection |
| 291 | + data for those domains. |
0 commit comments