[DOC] Callout the aggregation result may be approximate (#4922) (#4953)

opensearch-trigger-bot[bot] · github-actions[bot] · web-flow · commit ce8b452b5c59 · 2025-12-12T13:55:49.000+08:00
* [DOC] Callout the aggregation result may be approximate * add to limitation.rst * revert * add ignore format --------- (cherry picked from commit 90ee47c) Signed-off-by: Lantao Jin <ltjin@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
diff --git a/docs/user/ppl/cmd/stats.md b/docs/user/ppl/cmd/stats.md
@@ -48,6 +48,36 @@ The stats command supports the following aggregation functions:
 * VALUES: Collect unique values into sorted array  
   
 For detailed documentation of each function, see [Aggregation Functions](../functions/aggregations.md).
+
+## Limitations
+
+### Bucket aggregation result may be approximate in large dataset
+
+In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate.
+For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high.
+
+```ppl ignore
+source=hits
+| stats bucket_nullable=false count() as c by URL
+| sort - c
+| head 10
+```
+
+This query is pushed down to a terms bucket aggregation DSL query with `"order": { "_count": "desc" }`. In OpenSearch, this terms aggregation may throw away some buckets.
+
+### Sorting by ascending doc_count may produce inaccurate results
+
+Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results.
+
+```ppl ignore
+source=hits
+| stats bucket_nullable=false count() as c by URL
+| sort + c
+| head 10
+```
+
+A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results.
+
 ## Example 1: Calculate the count of events  
 
 This example shows calculating the count of events in the accounts.
diff --git a/docs/user/ppl/functions/aggregations.md b/docs/user/ppl/functions/aggregations.md
@@ -2,7 +2,7 @@
 
 ## Description  
 
-Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with `stats` and `eventstats` commands to analyze and summarize data.
+Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with `stats`, `eventstats` and `streamstats` commands to analyze and summarize data.
 The following table shows how NULL/MISSING values are handled by aggregation functions:
   
 | Function | NULL | MISSING |
diff --git a/docs/user/ppl/limitations/limitations.md b/docs/user/ppl/limitations/limitations.md
@@ -87,3 +87,30 @@ If a document contains malformed field names inside an object field, PPL ignores
 When ``log`` is an object field with ``enabled: false``, subfields with malformed names are ignored.
 
 **Recommendation:** Avoid using field names that contain leading dots, trailing dots, consecutive dots, or consist only of dots. This aligns with OpenSearch's default field naming requirements.
+
+## Bucket aggregation result may be approximate in large dataset
+
+In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate.
+For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high.
+
+```ppl ignore
+source=hits
+| stats bucket_nullable=false count() as c by URL
+| sort - c
+| head 10
+```
+
+This query is pushed down to a terms bucket aggregation DSL query with `"order": { "_count": "desc" }`. In OpenSearch, this terms aggregation may throw away some buckets.
+
+## Sorting by ascending doc_count may produce inaccurate results
+
+Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results.
+
+```ppl ignore
+source=hits
+| stats bucket_nullable=false count() as c by URL
+| sort + c
+| head 10
+```
+
+A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results.