Skip to content

Commit 1132e48

Browse files
committed
fix(codegen): treat dict-typed fields as leaf values in examples
flatten_example recursed into all dicts, splitting dict-typed fields like `tags: dict[str, str]` into dot-notation rows. Now collect_dict_paths walks the FieldSpec tree to identify dict-typed field paths, and _flatten_value checks membership before recursing. Indexed runtime paths (items[0].tags) are normalized to schema notation (items[].tags) for matching. The pipeline computes dict_paths from spec.fields and threads them through load_examples. Also: clarify mutual exclusion in type visitor elif chains (reverse_references, type_collection) and rename _TypeIdentity to _TypeShape in union_extraction to avoid shadowing specs.TypeIdentity.
1 parent 531621b commit 1132e48

10 files changed

Lines changed: 288 additions & 31 deletions

File tree

packages/overture-schema-codegen/docs/design.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,12 @@ Loads example data from theme `pyproject.toml` files, validates against Pydantic
240240
and flattens to dot-notation rows for display in feature pages. Also provides a starting
241241
point for generated test data.
242242

243+
`collect_dict_paths` walks the `FieldSpec` tree to identify dict-typed fields (like
244+
`tags: dict[str, str]`), returning their dot-paths as a `frozenset`. `flatten_example`
245+
checks this set before recursing into dicts -- paths in the set are kept as leaf values
246+
rather than being split into dot-notation rows. The pipeline computes `dict_paths` from
247+
`spec.fields` and threads it through `load_examples`.
248+
243249
## Extension Points
244250

245251
**Adding a new output target** (Arrow schemas next, PySpark expressions after): Add a

packages/overture-schema-codegen/docs/walkthrough.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -609,8 +609,15 @@ non-selected variant arms. `_strip_null_unknown_fields` removes null-valued fiel
609609
in the common base's field set, so the selected arm's validator accepts the data without
610610
choking on fields that belong to sibling variants.
611611

612+
`collect_dict_paths` walks the `FieldSpec` tree to identify dict-typed fields (like
613+
`tags: dict[str, str]`), returning their dot-paths as a `frozenset`. Schema-notation
614+
paths use empty brackets (`items[].tags`) while runtime paths carry indices
615+
(`items[0].tags`); `_normalize_path` strips indices before membership checks.
616+
612617
`flatten_example` converts nested dicts to dot-notation. Nested dicts become
613-
`parent.child`, lists of dicts become `parent[0].child`. `order_example_rows` sorts by
618+
`parent.child`, lists of dicts become `parent[0].child`. Dicts at paths in `dict_paths`
619+
are kept as leaf values -- a `tags` field typed as `dict[str, str]` renders as a whole
620+
map rather than being split into `tags.color`, `tags.size`. `order_example_rows` sorts by
614621
field position in the documentation's field order using a stable sort, so sub-fields
615622
maintain their original relative order.
616623

@@ -732,8 +739,9 @@ sources appear on the source NewType's page instead.
732739

733740
The example loader finds `pyproject.toml` in the transportation theme package, reads
734741
`[examples.Segment]`, validates each example against the union alias (injecting literal
735-
fields, stripping null fields from non-selected arms), flattens to dot-notation, and
736-
orders by field position.
742+
fields, stripping null fields from non-selected arms), computes `dict_paths` from
743+
`spec.fields` to identify dict-typed fields, flattens to dot-notation (keeping dict-typed
744+
fields as leaf values), and orders by field position.
737745

738746
The Jinja2 template assembles the field table, optional constraints section, examples,
739747
and "Used By" partial into markdown.

packages/overture-schema-codegen/src/overture/schema/codegen/example_loader.py

Lines changed: 61 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
"""Load and process example data from theme pyproject.toml files."""
22

33
import logging
4+
import re
45
import sys
56
from dataclasses import dataclass
67
from pathlib import Path
@@ -10,11 +11,12 @@
1011
from pydantic.fields import FieldInfo
1112

1213
from .model_extraction import resolve_field_alias
14+
from .specs import FieldSpec
1315
from .type_analyzer import single_literal_value
1416

1517
log = logging.getLogger(__name__)
1618

17-
__all__ = ["ExampleRecord", "load_examples", "validate_example"]
19+
__all__ = ["ExampleRecord", "collect_dict_paths", "load_examples", "validate_example"]
1820

1921
# tomllib is stdlib from 3.11+; tomli is the backport for 3.10.
2022
try:
@@ -140,19 +142,63 @@ def validate_example(
140142

141143

142144
_DEFAULT_SKIP_KEYS: frozenset[str] = frozenset()
145+
_DEFAULT_DICT_PATHS: frozenset[str] = frozenset()
143146

147+
_INDEXED_BRACKET = re.compile(r"\[\d+\]")
144148

145-
def _flatten_value(prefix: str, value: object) -> list[tuple[str, Any]]:
149+
150+
def _normalize_path(path: str) -> str:
151+
"""Replace indexed brackets with empty brackets for dict_paths matching.
152+
153+
``collect_dict_paths`` produces schema-notation paths like
154+
``items[].tags``, while ``_flatten_value`` builds runtime paths like
155+
``items[0].tags``. Normalizing before membership testing makes them
156+
comparable.
157+
"""
158+
return _INDEXED_BRACKET.sub("[]", path)
159+
160+
161+
def collect_dict_paths(fields: list[FieldSpec], prefix: str = "") -> frozenset[str]:
162+
"""Collect dot-paths of dict-typed fields from a FieldSpec tree.
163+
164+
Walks the ``FieldSpec.model`` tree (same structure the renderer walks
165+
for inline expansion) and returns paths where ``type_info.is_dict``
166+
is True. These paths tell ``flatten_example`` which dicts are maps
167+
(keep as leaf) vs. models (recurse into).
168+
169+
Parameters
170+
----------
171+
fields : list[FieldSpec]
172+
Fields to walk.
173+
prefix : str
174+
Dot-notation prefix accumulated from parent fields.
175+
"""
176+
paths: set[str] = set()
177+
for f in fields:
178+
path = f"{prefix}{f.name}" if prefix else f.name
179+
if f.type_info.is_dict:
180+
paths.add(path)
181+
elif f.model and not f.starts_cycle:
182+
suffix = "[]" * f.type_info.list_depth if f.type_info.is_list else ""
183+
paths |= collect_dict_paths(f.model.fields, f"{path}{suffix}.")
184+
return frozenset(paths)
185+
186+
187+
def _flatten_value(
188+
prefix: str, value: object, dict_paths: frozenset[str]
189+
) -> list[tuple[str, Any]]:
146190
"""Recursively flatten a value into dot/bracket-notation rows."""
147191
if isinstance(value, dict):
192+
if _normalize_path(prefix) in dict_paths:
193+
return [(prefix, value)]
148194
result: list[tuple[str, Any]] = []
149195
for k, v in value.items():
150-
result.extend(_flatten_value(f"{prefix}.{k}", v))
196+
result.extend(_flatten_value(f"{prefix}.{k}", v, dict_paths))
151197
return result
152198
if isinstance(value, list) and value and isinstance(value[0], (dict, list)):
153199
result = []
154200
for i, item in enumerate(value):
155-
result.extend(_flatten_value(f"{prefix}[{i}]", item))
201+
result.extend(_flatten_value(f"{prefix}[{i}]", item, dict_paths))
156202
return result
157203
return [(prefix, value)]
158204

@@ -161,19 +207,24 @@ def flatten_example(
161207
raw: dict[str, Any],
162208
*,
163209
skip_keys: frozenset[str] = _DEFAULT_SKIP_KEYS,
210+
dict_paths: frozenset[str] = _DEFAULT_DICT_PATHS,
164211
) -> list[tuple[str, Any]]:
165212
"""Flatten nested example dict to dot-notation key-value pairs.
166213
167214
Nested dicts become ``"parent.child"``; lists of dicts become
168215
``"parent[0].child"``; lists of lists of dicts use double-index
169216
notation ``"parent[0][1].child"``. Keys in *skip_keys* are dropped
170217
at the top level only. Plain lists are kept as values.
218+
219+
Dicts at paths in *dict_paths* are kept as leaf values instead of
220+
being recursed into. Use ``collect_dict_paths`` to compute this set
221+
from a FieldSpec tree.
171222
"""
172223
result: list[tuple[str, Any]] = []
173224
for key, value in raw.items():
174225
if key in skip_keys:
175226
continue
176-
result.extend(_flatten_value(key, value))
227+
result.extend(_flatten_value(key, value, dict_paths))
177228
return result
178229

179230

@@ -257,6 +308,7 @@ def load_examples(
257308
*,
258309
pyproject_source: type | None = None,
259310
model_fields: dict[str, FieldInfo] | None = None,
311+
dict_paths: frozenset[str] = _DEFAULT_DICT_PATHS,
260312
) -> list[ExampleRecord]:
261313
"""Load examples for a model, flattened and ordered by *field_names*.
262314
@@ -278,6 +330,9 @@ def load_examples(
278330
model_fields : dict[str, FieldInfo] or None
279331
Field info dict for Literal injection. If None, infers
280332
from validation_type if it's a BaseModel class.
333+
dict_paths : frozenset[str]
334+
Dot-paths of dict-typed fields to keep as leaf values.
335+
Use ``collect_dict_paths`` to compute from a FieldSpec tree.
281336
"""
282337
source_type = pyproject_source if pyproject_source is not None else validation_type
283338
if not isinstance(source_type, type):
@@ -308,7 +363,7 @@ def load_examples(
308363
e,
309364
)
310365
continue
311-
flat_rows = flatten_example(denulled)
366+
flat_rows = flatten_example(denulled, dict_paths=dict_paths)
312367
ordered_rows = order_example_rows(flat_rows, field_names)
313368
records.append(ExampleRecord(rows=ordered_rows))
314369

packages/overture-schema-codegen/src/overture/schema/codegen/markdown_pipeline.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
import overture.schema.system.primitive as _system_primitive
1414
from overture.schema.system.primitive import GeometryType
1515

16-
from .example_loader import ExampleRecord, load_examples
16+
from .example_loader import ExampleRecord, collect_dict_paths, load_examples
1717
from .link_computation import LinkContext
1818
from .markdown_renderer import (
1919
render_enum,
@@ -74,12 +74,14 @@ def _load_model_examples(
7474
if not pyproject_source:
7575
return None
7676
field_names = [f.name for f in spec.fields]
77+
dict_paths = collect_dict_paths(spec.fields)
7778
examples = load_examples(
7879
validation_type,
7980
spec.name,
8081
field_names,
8182
pyproject_source=pyproject_source,
8283
model_fields=model_fields,
84+
dict_paths=dict_paths,
8385
)
8486
return examples or None
8587

packages/overture-schema-codegen/src/overture/schema/codegen/markdown_renderer.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
"""Markdown renderer for Pydantic model documentation."""
22

33
import functools
4+
import json
45
import re
56
from collections.abc import Callable
67
from dataclasses import dataclass
@@ -178,11 +179,11 @@ def _format_example_value(value: object) -> str:
178179
return f"`{_truncate(value)}`"
179180

180181
if isinstance(value, list):
181-
items = ", ".join(repr(item) for item in value)
182+
items = ", ".join(json.dumps(item) for item in value)
182183
return f"`{_truncate(f'[{items}]')}`"
183184

184185
if isinstance(value, dict):
185-
pairs = ", ".join(f"{k}: {v}" for k, v in value.items())
186+
pairs = ", ".join(f"{json.dumps(k)}: {json.dumps(v)}" for k, v in value.items())
186187
return f"`{_truncate(f'{{{pairs}}}')}`"
187188

188189
return f"`{value}`"

packages/overture-schema-codegen/src/overture/schema/codegen/reverse_references.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,8 @@ def _visit(node: TypeInfo) -> None:
8585
referrer_kind,
8686
)
8787

88+
# ENUM, MODEL, pydantic (PRIMITIVE), and UNION are mutually
89+
# exclusive by TypeKind.
8890
if (
8991
node.kind in (TypeKind.ENUM, TypeKind.MODEL)
9092
and node.source_type is not None
@@ -94,13 +96,11 @@ def _visit(node: TypeInfo) -> None:
9496
referrer,
9597
referrer_kind,
9698
)
97-
98-
if is_pydantic_type(node):
99+
elif is_pydantic_type(node):
99100
add_reference(
100101
TypeIdentity.of(node.source_type), referrer, referrer_kind
101102
)
102-
103-
if node.union_members is not None:
103+
elif node.union_members is not None:
104104
for member_cls in node.union_members:
105105
add_reference(
106106
TypeIdentity.of(member_cls),

packages/overture-schema-codegen/src/overture/schema/codegen/type_collection.py

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,9 @@ def _collect_from_type_info(ti: TypeInfo) -> None:
8787
"""
8888

8989
def _visit(node: TypeInfo) -> None:
90+
# UNION, ENUM, and pydantic (PRIMITIVE) are mutually exclusive
91+
# by TypeKind. NewType extraction is orthogonal -- a node can be
92+
# a NewType-wrapped ENUM, for instance.
9093
if node.kind == TypeKind.UNION and node.union_members:
9194
# Walk each member's fields for supplementary types.
9295
# Members that are also top-level feature specs are skipped
@@ -95,11 +98,15 @@ def _visit(node: TypeInfo) -> None:
9598
member_spec = extract_model(member_cls)
9699
expand_model_tree(member_spec)
97100
_collect_from_model(member_spec)
98-
99-
if node.kind == TypeKind.ENUM and node.source_type is not None:
101+
elif node.kind == TypeKind.ENUM and node.source_type is not None:
100102
enum_id = TypeIdentity.of(node.source_type)
101103
if enum_id not in all_specs:
102104
all_specs[enum_id] = extract_enum(node.source_type)
105+
elif is_pydantic_type(node):
106+
assert node.source_type is not None # guaranteed by is_pydantic_type
107+
pid = TypeIdentity.of(node.source_type)
108+
if pid not in all_specs:
109+
all_specs[pid] = extract_pydantic_type(node.source_type)
103110

104111
# Semantic NewTypes always get extracted, including intermediate
105112
# NewTypes in the wrapping chain (e.g., Id wraps NoWhitespaceString
@@ -115,12 +122,6 @@ def _visit(node: TypeInfo) -> None:
115122
if newly_registered:
116123
_collect_inner_newtypes(node.newtype_ref)
117124

118-
if is_pydantic_type(node):
119-
assert node.source_type is not None # guaranteed by is_pydantic_type
120-
pid = TypeIdentity.of(node.source_type)
121-
if pid not in all_specs:
122-
all_specs[pid] = extract_pydantic_type(node.source_type)
123-
124125
walk_type_info(ti, _visit)
125126

126127
def _collect_from_fields(fields: list[FieldSpec]) -> None:

packages/overture-schema-codegen/src/overture/schema/codegen/union_extraction.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -76,12 +76,12 @@ def extract_discriminator(
7676
return disc_field_name, mapping or None
7777

7878

79-
_TypeIdentity = tuple[str, TypeKind, bool, int]
80-
_FieldKey = tuple[str, _TypeIdentity]
79+
_TypeShape = tuple[str, TypeKind, bool, int]
80+
_FieldKey = tuple[str, _TypeShape]
8181

8282

83-
def _type_identity(ti: TypeInfo) -> _TypeIdentity:
84-
"""Stable identity for dedup excludes source_type which can vary across members."""
83+
def _type_shape(ti: TypeInfo) -> _TypeShape:
84+
"""Structural shape for dedup -- excludes source_type which varies across members."""
8585
return (ti.base_type, ti.kind, ti.is_optional, ti.list_depth)
8686

8787

@@ -117,7 +117,7 @@ def extract_union(
117117
for fs in member_spec.fields:
118118
if fs.name in shared_field_names:
119119
continue
120-
key = (fs.name, _type_identity(fs.type_info))
120+
key = (fs.name, _type_shape(fs.type_info))
121121
existing = seen.get(key)
122122
prior_sources = existing.variant_sources or () if existing else ()
123123
seen[key] = AnnotatedField(

0 commit comments

Comments
 (0)