Skip to content

Commit ab61642

Browse files
soooojinleeclaudefranciscojavierarceo
authored
feat: Support nested collection types (Array/Set of Array/Set) (#5947) (#6132)
* feat: Support nested collection types (Array/Set of Array/Set) (#5947) Add support for 2-level nested collection types: Array(Array(T)), Array(Set(T)), Set(Array(T)), and Set(Set(T)). - Add 4 generic ValueType enums (LIST_LIST, LIST_SET, SET_LIST, SET_SET) backed by RepeatedValue proto messages - Persist inner type info in Field tags (feast:nested_inner_type), following the existing Struct schema tag pattern - Handle edge cases: empty inner collections, Set dedup at inner level, depth limit enforcement (2 levels max) - Add proto/JSON/remote transport serialization support - Add 25 unit tests covering all combinations and edge cases Signed-off-by: Soojin Lee <lsjin0602@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: soojin <soojin@dable.io> * fix: Fix remote online read for nested collection types and add docs - Fix remote online store read path to use declared feature types from FeatureView instead of ValueType.UNKNOWN, which fails for nested collection types (LIST_LIST, LIST_SET, SET_LIST, SET_SET) - Add Nested Collection Types section to type-system.md with type table, usage examples, and empty-inner-collection→None limitation docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: soojin <soojin@dable.io> * fix: Fix JSON deserialization, schema inference, and silent fallback for nested collection types - Add nested list handling in proto_json from_json_object (list of lists was raising ParseError since no branch matched list-typed elements) - Fix pa_to_feast_value_type to recognize nested list PyArrow types (list<item: list<item: T>>) instead of crashing with KeyError - Replace silent String fallback in _str_to_feast_type with ValueError to surface corrupted tag values instead of silently losing type info - Strengthen test coverage: type str roundtrip, inner value verification, multi-value batch, proto JSON roundtrip, PyArrow nested type inference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: soojin <soojin@dable.io> * fix: Fix mypy type error in nested collection proto construction Use getattr/CopyFrom instead of **dict unpacking for ProtoValue construction to satisfy mypy's strict type checking. Signed-off-by: soojin <soojin@dable.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: soojin <soojin@dable.io> * fix: Fix equality comparison for nested types and JSON deserialization edge case - Add __eq__/__hash__ to Array and Set so inner element types are compared (previously Array(Array(String)) == Array(Array(Int32)) was True) - Fix nested collection detection in proto_json when first element is None by using any() fallback instead of only checking value[0] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: soojin <soojin@dable.io> * feat: Remove depth limit for nested collection types and improve test coverage - Remove 2-level depth restriction from Array and Set constructors to support unbounded nesting per maintainer request - Make _convert_nested_collection_to_proto() recursive for 3+ levels - Update error message for nested type inference to guide users toward explicit Field dtype declaration - Add 3+ level tests for Field roundtrip, str roundtrip, and PyArrow conversion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: soojin <soojin@dable.io> * refactor: Replace combinatorial nested collection enums with recursive VALUE_LIST/VALUE_SET Replace 4 combinatorial enum values (LIST_LIST=36, LIST_SET=37, SET_LIST=38, SET_SET=39) with 2 recursive enum values (VALUE_LIST=40, VALUE_SET=41) that use RepeatedValue to enable unlimited nesting depth. This is a breaking change for an unreleased feature, as suggested in PR #6132 review. Key changes: - Proto: Remove 4 enum/oneof fields, add VALUE_LIST/VALUE_SET with reserved 36-39 - Python: Update ValueType enum, type system, serialization, field persistence - JSON: Update proto_json encode/decode for new field names - Tests: Rewrite all nested collection tests (204 tests passing) - Docs: Update type-system.md for recursive design Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: soojin <soojin@dable.io> * fix: Preserve inner element types in PyArrow schema inference and optimize JSON nested list detection - Add _parse_pa_type_str() to reconstruct PyArrow types from type strings for VALUE_LIST/VALUE_SET, avoiding lossy round-trip through placeholder - Optimize proto_json nested list detection: only scan with any() when first element is None, avoiding O(n) scan for flat lists - Add warning log for unrecognized PyArrow type strings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: soojin <soojin@dable.io> * fix: Add np.ndarray support in nested collection proto conversion and clarify placeholder pyarrow type - Add np.ndarray to isinstance check in _convert_nested_collection_to_proto to fix KeyError for 3+ level nesting during materialization (PyArrow produces np.ndarray, not Python list) - Add comment clarifying VALUE_LIST/VALUE_SET placeholder in feast_value_type_to_pa Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: soojin <soojin@dable.io> --------- Signed-off-by: Soojin Lee <lsjin0602@gmail.com> Signed-off-by: soojin <soojin@dable.io> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
1 parent b3dcde7 commit ab61642

14 files changed

Lines changed: 657 additions & 81 deletions

File tree

docs/reference/type-system.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,25 @@ All primitive types (except `Map` and `Json`) have corresponding set types for s
8686
- Set types are best suited for **online serving** use cases where feature values are written as Python sets and retrieved via `get_online_features`.
8787
{% endhint %}
8888

89+
### Nested Collection Types
90+
91+
Feast supports arbitrarily nested collections using a recursive `VALUE_LIST` / `VALUE_SET` design. The outer container determines the proto enum (`VALUE_LIST` for `Array(…)`, `VALUE_SET` for `Set(…)`), while the full inner type structure is persisted via a mandatory `feast:nested_inner_type` Field tag.
92+
93+
| Feast Type | Python Type | ValueType | Description |
94+
|------------|-------------|-----------|-------------|
95+
| `Array(Array(T))` | `List[List[T]]` | `VALUE_LIST` | List of lists |
96+
| `Array(Set(T))` | `List[List[T]]` | `VALUE_LIST` | List of sets |
97+
| `Set(Array(T))` | `List[List[T]]` | `VALUE_SET` | Set of lists |
98+
| `Set(Set(T))` | `List[List[T]]` | `VALUE_SET` | Set of sets |
99+
| `Array(Array(Array(T)))` | `List[List[List[T]]]` | `VALUE_LIST` | 3-level nesting |
100+
101+
Where `T` is any supported primitive type (Int32, Int64, Float32, Float64, String, Bytes, Bool, UnixTimestamp) or another nested collection type.
102+
103+
**Notes:**
104+
- Nesting depth is **unlimited**. `Array(Array(Array(T)))`, `Set(Array(Set(T)))`, etc. are all supported.
105+
- Inner type information is preserved via Field tags (`feast:nested_inner_type`) and restored during deserialization. This tag is mandatory for nested collection types.
106+
- Empty inner collections (`[]`) are stored as empty proto values and round-trip as `None`. For example, `[[1, 2], [], [3]]` becomes `[[1, 2], None, [3]]` after a write-read cycle.
107+
89108
### Map Types
90109

91110
Map types allow storing dictionary-like data structures:
@@ -233,6 +252,10 @@ user_features = FeatureView(
233252
Field(name="metadata", dtype=Map),
234253
Field(name="activity_log", dtype=Array(Map)),
235254

255+
# Nested collection types
256+
Field(name="weekly_scores", dtype=Array(Array(Float64))),
257+
Field(name="unique_tags_per_category", dtype=Array(Set(String))),
258+
236259
# JSON type
237260
Field(name="raw_event", dtype=Json),
238261

@@ -290,6 +313,30 @@ related_sessions = [uuid.uuid4(), uuid.uuid4(), uuid.uuid4()]
290313
unique_devices = {uuid.uuid4(), uuid.uuid4()}
291314
```
292315

316+
### Nested Collection Type Usage Examples
317+
318+
Nested collections allow storing multi-dimensional data with unlimited depth:
319+
320+
```python
321+
# List of lists — e.g., weekly score history per user
322+
weekly_scores = [[85.0, 90.5, 78.0], [92.0, 88.5], [95.0, 91.0, 87.5]]
323+
324+
# List of sets — e.g., unique tags assigned per category
325+
unique_tags_per_category = [["python", "ml"], ["rust", "systems"], ["python", "web"]]
326+
327+
# 3-level nesting — e.g., multi-dimensional matrices
328+
Field(name="tensor", dtype=Array(Array(Array(Float64))))
329+
330+
# Mixed nesting
331+
Field(name="grouped_tags", dtype=Array(Set(Array(String))))
332+
```
333+
334+
**Limitation:** Empty inner collections round-trip as `None`:
335+
```python
336+
# Input: [[1, 2], [], [3]]
337+
# Output: [[1, 2], None, [3]] (empty [] becomes None after write-read cycle)
338+
```
339+
293340
### Map Type Usage Examples
294341

295342
Maps can store complex nested data structures:

protos/feast/types/Value.proto

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@ message ValueType {
6363
TIME_UUID_LIST = 39;
6464
UUID_SET = 40;
6565
TIME_UUID_SET = 41;
66+
VALUE_LIST = 42;
67+
VALUE_SET = 43;
6668
}
6769
}
6870

@@ -108,6 +110,8 @@ message Value {
108110
StringList time_uuid_list_val = 39;
109111
StringSet uuid_set_val = 40;
110112
StringSet time_uuid_set_val = 41;
113+
RepeatedValue list_val = 42;
114+
RepeatedValue set_val = 43;
111115
}
112116
}
113117

sdk/python/feast/field.py

Lines changed: 31 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
from feast.value_type import ValueType
2424

2525
STRUCT_SCHEMA_TAG = "feast:struct_schema"
26+
NESTED_COLLECTION_INNER_TYPE_TAG = "feast:nested_inner_type"
2627

2728

2829
@typechecked
@@ -118,7 +119,7 @@ def __str__(self):
118119

119120
def to_proto(self) -> FieldProto:
120121
"""Converts a Field object to its protobuf representation."""
121-
from feast.types import Array
122+
from feast.types import Array, Set
122123

123124
value_type = self.dtype.to_value_type()
124125
vector_search_metric = self.vector_search_metric or ""
@@ -128,6 +129,11 @@ def to_proto(self) -> FieldProto:
128129
tags[STRUCT_SCHEMA_TAG] = _serialize_struct_schema(self.dtype)
129130
elif isinstance(self.dtype, Array) and isinstance(self.dtype.base_type, Struct):
130131
tags[STRUCT_SCHEMA_TAG] = _serialize_struct_schema(self.dtype.base_type)
132+
# Persist nested collection type info in tags
133+
if isinstance(self.dtype, (Array, Set)) and isinstance(
134+
self.dtype.base_type, (Array, Set)
135+
):
136+
tags[NESTED_COLLECTION_INNER_TYPE_TAG] = _feast_type_to_str(self.dtype)
131137
return FieldProto(
132138
name=self.name,
133139
value_type=value_type.value,
@@ -155,17 +161,24 @@ def from_proto(cls, field_proto: FieldProto):
155161
# Reconstruct Struct type from persisted schema in tags
156162
from feast.types import Array
157163

164+
internal_tags = {STRUCT_SCHEMA_TAG, NESTED_COLLECTION_INNER_TYPE_TAG}
158165
dtype: FeastType
159166
if value_type == ValueType.STRUCT and STRUCT_SCHEMA_TAG in tags:
160167
dtype = _deserialize_struct_schema(tags[STRUCT_SCHEMA_TAG])
161-
user_tags = {k: v for k, v in tags.items() if k != STRUCT_SCHEMA_TAG}
168+
user_tags = {k: v for k, v in tags.items() if k not in internal_tags}
162169
elif value_type == ValueType.STRUCT_LIST and STRUCT_SCHEMA_TAG in tags:
163170
inner_struct = _deserialize_struct_schema(tags[STRUCT_SCHEMA_TAG])
164171
dtype = Array(inner_struct)
165-
user_tags = {k: v for k, v in tags.items() if k != STRUCT_SCHEMA_TAG}
172+
user_tags = {k: v for k, v in tags.items() if k not in internal_tags}
173+
elif (
174+
value_type in (ValueType.VALUE_LIST, ValueType.VALUE_SET)
175+
and NESTED_COLLECTION_INNER_TYPE_TAG in tags
176+
):
177+
dtype = _str_to_feast_type(tags[NESTED_COLLECTION_INNER_TYPE_TAG])
178+
user_tags = {k: v for k, v in tags.items() if k not in internal_tags}
166179
else:
167180
dtype = from_value_type(value_type=value_type)
168-
user_tags = tags
181+
user_tags = {k: v for k, v in tags.items() if k not in internal_tags}
169182

170183
return cls(
171184
name=field_proto.name,
@@ -198,6 +211,7 @@ def _feast_type_to_str(feast_type: FeastType) -> str:
198211
from feast.types import (
199212
Array,
200213
PrimitiveFeastType,
214+
Set,
201215
)
202216

203217
if isinstance(feast_type, PrimitiveFeastType):
@@ -209,6 +223,8 @@ def _feast_type_to_str(feast_type: FeastType) -> str:
209223
return json.dumps({"__struct__": nested})
210224
elif isinstance(feast_type, Array):
211225
return f"Array({_feast_type_to_str(feast_type.base_type)})"
226+
elif isinstance(feast_type, Set):
227+
return f"Set({_feast_type_to_str(feast_type.base_type)})"
212228
else:
213229
return str(feast_type)
214230

@@ -218,6 +234,7 @@ def _str_to_feast_type(type_str: str) -> FeastType:
218234
from feast.types import (
219235
Array,
220236
PrimitiveFeastType,
237+
Set,
221238
)
222239

223240
# Check if it's an Array type
@@ -226,6 +243,12 @@ def _str_to_feast_type(type_str: str) -> FeastType:
226243
base_type = _str_to_feast_type(inner)
227244
return Array(base_type)
228245

246+
# Check if it's a Set type
247+
if type_str.startswith("Set(") and type_str.endswith(")"):
248+
inner = type_str[4:-1]
249+
base_type = _str_to_feast_type(inner)
250+
return Set(base_type)
251+
229252
# Check if it's a nested Struct (JSON encoded)
230253
if type_str.startswith("{"):
231254
try:
@@ -243,9 +266,10 @@ def _str_to_feast_type(type_str: str) -> FeastType:
243266
try:
244267
return PrimitiveFeastType[type_str]
245268
except KeyError:
246-
from feast.types import String
247-
248-
return String
269+
raise ValueError(
270+
f"Unknown FeastType: {type_str!r}. "
271+
f"Valid primitive types: {[t.name for t in PrimitiveFeastType]}"
272+
)
249273

250274

251275
def _serialize_struct_schema(struct_type: Struct) -> str:

sdk/python/feast/infra/offline_stores/offline_utils.py

Lines changed: 39 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import logging
12
import uuid
23
from dataclasses import asdict, dataclass
34
from datetime import datetime, timedelta, timezone
@@ -21,6 +22,7 @@
2122
from feast.repo_config import RepoConfig
2223
from feast.type_map import feast_value_type_to_pa
2324
from feast.utils import _get_requested_feature_views_to_features_dict, to_naive_utc
25+
from feast.value_type import ValueType
2426

2527
DEFAULT_ENTITY_DF_EVENT_TIMESTAMP_COL = "event_timestamp"
2628

@@ -241,6 +243,37 @@ def get_offline_store_from_config(offline_store_config: Any) -> OfflineStore:
241243
return offline_store_class()
242244

243245

246+
_PA_BASIC_TYPES = {
247+
"int32": pa.int32(),
248+
"int64": pa.int64(),
249+
"double": pa.float64(),
250+
"float": pa.float32(),
251+
"string": pa.string(),
252+
"binary": pa.binary(),
253+
"bool": pa.bool_(),
254+
"large_string": pa.large_string(),
255+
"null": pa.null(),
256+
}
257+
258+
259+
def _parse_pa_type_str(pa_type_str: str) -> pa.DataType:
260+
"""Parse a PyArrow type string to preserve inner element types for nested lists."""
261+
pa_type_str = pa_type_str.strip()
262+
if pa_type_str.startswith("list<item: ") and pa_type_str.endswith(">"):
263+
inner = pa_type_str[len("list<item: ") : -1]
264+
return pa.list_(_parse_pa_type_str(inner))
265+
if pa_type_str in _PA_BASIC_TYPES:
266+
return _PA_BASIC_TYPES[pa_type_str]
267+
if pa_type_str.startswith("timestamp"):
268+
return pa.timestamp("us")
269+
logger = logging.getLogger(__name__)
270+
logger.warning(
271+
"Unrecognized PyArrow type string '%s', falling back to pa.string()",
272+
pa_type_str,
273+
)
274+
return pa.string()
275+
276+
244277
def get_pyarrow_schema_from_batch_source(
245278
config: RepoConfig, batch_source: DataSource, timestamp_unit: str = "us"
246279
) -> Tuple[pa.Schema, List[str]]:
@@ -250,15 +283,12 @@ def get_pyarrow_schema_from_batch_source(
250283
pa_schema = []
251284
column_names = []
252285
for column_name, column_type in column_names_and_types:
253-
pa_schema.append(
254-
(
255-
column_name,
256-
feast_value_type_to_pa(
257-
batch_source.source_datatype_to_feast_value_type()(column_type),
258-
timestamp_unit=timestamp_unit,
259-
),
260-
)
261-
)
286+
value_type = batch_source.source_datatype_to_feast_value_type()(column_type)
287+
if value_type in (ValueType.VALUE_LIST, ValueType.VALUE_SET):
288+
pa_type = _parse_pa_type_str(column_type)
289+
else:
290+
pa_type = feast_value_type_to_pa(value_type, timestamp_unit=timestamp_unit)
291+
pa_schema.append((column_name, pa_type))
262292
column_names.append(column_name)
263293

264294
return pa.schema(pa_schema), column_names

sdk/python/feast/infra/online_stores/remote.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,11 @@ def _proto_value_to_transport_value(proto_value: ValueProto) -> Any:
106106
if val_attr == "json_list_val":
107107
return list(getattr(proto_value, val_attr).val)
108108

109+
# Nested collection types use feast_value_type_to_python_type
110+
# which handles recursive conversion of RepeatedValue protos.
111+
if val_attr in ("list_val", "set_val"):
112+
return feast_value_type_to_python_type(proto_value)
113+
109114
# Map/Struct types are converted to Python dicts by
110115
# feast_value_type_to_python_type. Serialise them to JSON strings
111116
# so the server-side DataFrame gets VARCHAR columns instead of
@@ -204,6 +209,12 @@ def online_read(
204209
logger.debug("Able to retrieve the online features from feature server.")
205210
response_json = json.loads(response.text)
206211
event_ts = self._get_event_ts(response_json)
212+
# Build feature name -> ValueType mapping so we can reconstruct
213+
# complex types (nested collections, sets, etc.) that cannot be
214+
# inferred from raw JSON values alone.
215+
feature_type_map: Dict[str, ValueType] = {
216+
f.name: f.dtype.to_value_type() for f in table.features
217+
}
207218
# Iterating over results and converting the API results in column format to row format.
208219
result_tuples: List[
209220
Tuple[Optional[datetime], Optional[Dict[str, ValueProto]]]
@@ -223,13 +234,16 @@ def online_read(
223234
]
224235
== "PRESENT"
225236
):
237+
feature_value_type = feature_type_map.get(
238+
feature_name, ValueType.UNKNOWN
239+
)
226240
message = python_values_to_proto_values(
227241
[
228242
response_json["results"][index]["values"][
229243
feature_value_index
230244
]
231245
],
232-
ValueType.UNKNOWN,
246+
feature_value_type,
233247
)
234248
feature_values_dict[feature_name] = message[0]
235249
else:

sdk/python/feast/proto_json.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,12 @@ def to_json_object(printer: _Printer, message: ProtoMessage) -> JsonObject:
6363
# to JSON. The parse back result will be different from original message.
6464
if which is None or which == "null_val":
6565
return None
66+
elif which in ("list_val", "set_val"):
67+
# Nested collection: RepeatedValue containing Values
68+
repeated = getattr(message, which)
69+
value = [
70+
printer._MessageToJsonObject(inner_val) for inner_val in repeated.val
71+
]
6672
elif "_list_" in which:
6773
value = list(getattr(message, which).val)
6874
else:
@@ -86,6 +92,19 @@ def from_json_object(
8692
if len(value) == 0:
8793
# Clear will mark the struct as modified so it will be created even if there are no values
8894
message.int64_list_val.Clear()
95+
elif isinstance(value[0], list) or (
96+
value[0] is None and any(isinstance(v, list) for v in value)
97+
):
98+
# Nested collection (list of lists).
99+
# Check any() to handle cases where the first element is None
100+
# (empty inner collections round-trip through proto as None).
101+
# Default to list_val since JSON transport loses the
102+
# outer/inner set distinction.
103+
rv = RepeatedValue()
104+
for inner in value:
105+
inner_val = rv.val.add()
106+
from_json_object(parser, inner, inner_val)
107+
message.list_val.CopyFrom(rv)
89108
elif isinstance(value[0], bool):
90109
message.bool_list_val.val.extend(value)
91110
elif isinstance(value[0], str):

0 commit comments

Comments
 (0)