Skip to content

OpenSearch: Support nested Field Type for Correct Multi-Condition Array Filtering #3040

@deep-rloebbert

Description

@deep-rloebbert

Feature Request: Support OpenSearch nested field type for correct multi-condition array filtering

Is your feature request related to a problem? Please describe.

When filtering on multiple sub-fields of the same object array field, filters produce false positives (which is expected by OpenSearch defaulting to object storage for these elements). Here's a minimal example:

store.write_documents([
    Document(content="Doc A", meta={
        "references": [
            {"law": "bgb", "section": "81"},   # bgb is here...
            {"law": "stgb", "section": "1"},  # ...1 is here — different entry
        ]
    }),
    Document(content="Doc B", meta={
        "references": [
            {"law": "bgb", "section": "1"},     # both in same entry — correct match
        ]
    }),
])

filters = {
    "operator": "AND",
    "conditions": [
        {"field": "meta.references.law", "operator": "==", "value": "bgb"},
        {"field": "meta.references.section", "operator": "==", "value": "1"},
    ],
}

results = store.filter_documents(filters=filters)
# Returns BOTH documents — Doc A is a false positive
# Expected: only Doc B

Because array-of-object fields default to OpenSearch's object type, the array is internally de-normalized into parallel per-sub-field arrays. The filter conditions are then evaluated independently across the full array, not within the same array element. Doc A matches because "bgb" appears somewhere in references.law and "1" appears somewhere in references.section — even though they are in different objects.

OpenSearch's nested field type and nested query exist specifically to solve this. Haystack currently has no way to use them.


Describe the solution you'd like

Two things need to change:

1. OpenSearchDocumentStore needs to know which fields are nested

A new nested_fields init parameter should let users declare which metadata fields should be mapped as nested type when creating a new index. A "*" wildcard option should auto-detect list[dict] fields from the first document batch and declare them nested automatically.

For existing indexes, the store should read GET /<index>/_mapping on init and populate an internal _nested_fields set from the live mapping.

2. normalize_filters needs to emit nested query clauses

normalize_filters (in filters.py) should accept the set of known nested field paths. When conditions in a logical group target sub-fields of a nested-mapped field, the group should be wrapped in a nested query with the appropriate path, rather than a flat bool. Mixed filters (some conditions on nested fields, some on flat fields) should split cleanly: nested conditions get wrapped, the rest stay in the outer bool.


Describe alternatives you've considered

Pre-computed composite field at ingestion time — concatenate co-occurring sub-field values into a single derived field so one term filter replaces the two-condition query. Avoids re-indexing but requires ingestion pipeline changes, pollutes the schema, and doesn't generalize.


Additional context

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions