feat: adding count with filtering operations to OpenSearchDocumentStore#2653
Conversation
OpenSearchDocumentStore
|
Hey @tstadel I'd also appreciate your review on this since we want to make sure it will in platform as well. |
…res/opensearch/document_store.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
…res/opensearch/document_store.py Co-authored-by: Sebastian Husch Lee <10526848+sjrl@users.noreply.github.com>
| # Build aggregations | ||
| # Terms aggregation for paginated unique values | ||
| # Note: Terms aggregation doesn't support 'from' parameter directly, | ||
| # so we fetch from_ + size results and slice them | ||
| # Cardinality aggregation for total count | ||
| terms_size = from_ + size if from_ > 0 else size |
There was a problem hiding this comment.
Note that this only works till 10k. Overall the pagination is not ideal here. I'd suggest to switch over to composite aggregations which support proper pagination. However you'd need to change the signature to support the after param instead of from.
There was a problem hiding this comment.
I'd suggest to remove the from param altogether as it gives incorrect impressions here. That would be the quickest measure.
For full pagination support I'd recommend:
- support from and after on protocol level (as some document stores might support from-based and some after-based pagination)
- raise NotSupportedError for OpenSearch when from is set
- implement after-based pagination for OpenSearch using composite aggregations
There was a problem hiding this comment.
I went with option 3.
def get_metadata_field_unique_values(
self,
metadata_field: str,
search_term: str | None = None,
size: int | None = 10000,
after: dict[str, Any] | None = None,
)Regarding the Protocol - I support your suggestion of adding both from and after - but let's see how this applies to other DocumentStores, and then we can see if it's easier to add to Protocol.
I think there are common operations that can already be added to the Protocol. I need to review it in detail and make use of it to avoid having so much duplicated code for tests. There's an issue with it, and I will probably take it soon.
tstadel
left a comment
There was a problem hiding this comment.
@davidsbatista I left some comments. Biggest issue I see is concerning the pagination of unique_values. I'd probably just not support it here. We can always come back and implement a proper solution.
sjrl
left a comment
There was a problem hiding this comment.
One minor comment, otherwise looks good!
Related Issues
OpenSearchDocumentStore#2635Proposed Changes:
count_documents_by_filter()- count documents matching filter criteriacount_distinct_values_by_filter()- get distinct value counts for metadata fields with optional filteringget_fields_info()- retrieve field type information from index mappingget_field_min_max()- get min/max values for numeric metadata fieldsget_field_unique_values()- get unique values for a field with pagination and content-based filteringquery_sql()- execute SQL queries against OpenSearch with support for multiple response formats (JSON, CSV, JDBC, RAW)How did you test it?
Notes for the reviewer
httpx>=0.28.1dependencyChecklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.