Feature Request: Support OpenSearch nested field type for correct multi-condition array filtering
Is your feature request related to a problem? Please describe.
When filtering on multiple sub-fields of the same object array field, filters produce false positives (which is expected by OpenSearch defaulting to object storage for these elements). Here's a minimal example:
store.write_documents([
Document(content="Doc A", meta={
"references": [
{"law": "bgb", "section": "81"}, # bgb is here...
{"law": "stgb", "section": "1"}, # ...1 is here — different entry
]
}),
Document(content="Doc B", meta={
"references": [
{"law": "bgb", "section": "1"}, # both in same entry — correct match
]
}),
])
filters = {
"operator": "AND",
"conditions": [
{"field": "meta.references.law", "operator": "==", "value": "bgb"},
{"field": "meta.references.section", "operator": "==", "value": "1"},
],
}
results = store.filter_documents(filters=filters)
# Returns BOTH documents — Doc A is a false positive
# Expected: only Doc B
Because array-of-object fields default to OpenSearch's object type, the array is internally de-normalized into parallel per-sub-field arrays. The filter conditions are then evaluated independently across the full array, not within the same array element. Doc A matches because "bgb" appears somewhere in references.law and "1" appears somewhere in references.section — even though they are in different objects.
OpenSearch's nested field type and nested query exist specifically to solve this. Haystack currently has no way to use them.
Describe the solution you'd like
Two things need to change:
1. OpenSearchDocumentStore needs to know which fields are nested
A new nested_fields init parameter should let users declare which metadata fields should be mapped as nested type when creating a new index. A "*" wildcard option should auto-detect list[dict] fields from the first document batch and declare them nested automatically.
For existing indexes, the store should read GET /<index>/_mapping on init and populate an internal _nested_fields set from the live mapping.
2. normalize_filters needs to emit nested query clauses
normalize_filters (in filters.py) should accept the set of known nested field paths. When conditions in a logical group target sub-fields of a nested-mapped field, the group should be wrapped in a nested query with the appropriate path, rather than a flat bool. Mixed filters (some conditions on nested fields, some on flat fields) should split cleanly: nested conditions get wrapped, the rest stay in the outer bool.
Describe alternatives you've considered
Pre-computed composite field at ingestion time — concatenate co-occurring sub-field values into a single derived field so one term filter replaces the two-condition query. Avoids re-indexing but requires ingestion pipeline changes, pollutes the schema, and doesn't generalize.
Additional context
Feature Request: Support OpenSearch
nestedfield type for correct multi-condition array filteringIs your feature request related to a problem? Please describe.
When filtering on multiple sub-fields of the same object array field, filters produce false positives (which is expected by OpenSearch defaulting to object storage for these elements). Here's a minimal example:
Because array-of-object fields default to OpenSearch's
objecttype, the array is internally de-normalized into parallel per-sub-field arrays. The filter conditions are then evaluated independently across the full array, not within the same array element. Doc A matches because"bgb"appears somewhere inreferences.lawand"1"appears somewhere inreferences.section— even though they are in different objects.OpenSearch's
nestedfield type andnestedquery exist specifically to solve this. Haystack currently has no way to use them.Describe the solution you'd like
Two things need to change:
1.
OpenSearchDocumentStoreneeds to know which fields arenestedA new
nested_fieldsinit parameter should let users declare which metadata fields should be mapped asnestedtype when creating a new index. A"*"wildcard option should auto-detectlist[dict]fields from the first document batch and declare them nested automatically.For existing indexes, the store should read
GET /<index>/_mappingon init and populate an internal_nested_fieldsset from the live mapping.2.
normalize_filtersneeds to emitnestedquery clausesnormalize_filters(infilters.py) should accept the set of known nested field paths. When conditions in a logical group target sub-fields of a nested-mapped field, the group should be wrapped in anestedquery with the appropriatepath, rather than a flatbool. Mixed filters (some conditions on nested fields, some on flat fields) should split cleanly: nested conditions get wrapped, the rest stay in the outerbool.Describe alternatives you've considered
Pre-computed composite field at ingestion time — concatenate co-occurring sub-field values into a single derived field so one
termfilter replaces the two-condition query. Avoids re-indexing but requires ingestion pipeline changes, pollutes the schema, and doesn't generalize.Additional context
objectlimitation: https://docs.opensearch.org/latest/mappings/supported-field-types/nested/#flattened-formnestedtype: https://docs.opensearch.org/latest/mappings/supported-field-types/nested/#nested-field-type-1nestedquery: https://opensearch.org/docs/latest/query-dsl/joining/nested/objecttonestedin-place on an existing index — full re-index required. This should be clearly documented; the feature applies cleanly to new indexes only.boolquery against anested-mapped field produces no error but silently wrong results. If the store reads the live mapping on init, this mismatch becomes detectable — a warning should be raised rather than silently returning incorrect data.index.mapping.nested_objects.limitdefaults to 10,000 nested objects per document — worth noting for high-cardinality array fields.