Skip to content

Commit ce8b452

Browse files
[DOC] Callout the aggregation result may be approximate (#4922) (#4953)
* [DOC] Callout the aggregation result may be approximate * add to limitation.rst * revert * add ignore format --------- (cherry picked from commit 90ee47c) Signed-off-by: Lantao Jin <ltjin@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent bb43e53 commit ce8b452

3 files changed

Lines changed: 58 additions & 1 deletion

File tree

docs/user/ppl/cmd/stats.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,36 @@ The stats command supports the following aggregation functions:
4848
* VALUES: Collect unique values into sorted array
4949

5050
For detailed documentation of each function, see [Aggregation Functions](../functions/aggregations.md).
51+
52+
## Limitations
53+
54+
### Bucket aggregation result may be approximate in large dataset
55+
56+
In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate.
57+
For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high.
58+
59+
```ppl ignore
60+
source=hits
61+
| stats bucket_nullable=false count() as c by URL
62+
| sort - c
63+
| head 10
64+
```
65+
66+
This query is pushed down to a terms bucket aggregation DSL query with `"order": { "_count": "desc" }`. In OpenSearch, this terms aggregation may throw away some buckets.
67+
68+
### Sorting by ascending doc_count may produce inaccurate results
69+
70+
Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results.
71+
72+
```ppl ignore
73+
source=hits
74+
| stats bucket_nullable=false count() as c by URL
75+
| sort + c
76+
| head 10
77+
```
78+
79+
A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results.
80+
5181
## Example 1: Calculate the count of events
5282

5383
This example shows calculating the count of events in the accounts.

docs/user/ppl/functions/aggregations.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Description
44

5-
Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with `stats` and `eventstats` commands to analyze and summarize data.
5+
Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with `stats`, `eventstats` and `streamstats` commands to analyze and summarize data.
66
The following table shows how NULL/MISSING values are handled by aggregation functions:
77

88
| Function | NULL | MISSING |

docs/user/ppl/limitations/limitations.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,3 +87,30 @@ If a document contains malformed field names inside an object field, PPL ignores
8787
When ``log`` is an object field with ``enabled: false``, subfields with malformed names are ignored.
8888

8989
**Recommendation:** Avoid using field names that contain leading dots, trailing dots, consecutive dots, or consist only of dots. This aligns with OpenSearch's default field naming requirements.
90+
91+
## Bucket aggregation result may be approximate in large dataset
92+
93+
In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate.
94+
For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high.
95+
96+
```ppl ignore
97+
source=hits
98+
| stats bucket_nullable=false count() as c by URL
99+
| sort - c
100+
| head 10
101+
```
102+
103+
This query is pushed down to a terms bucket aggregation DSL query with `"order": { "_count": "desc" }`. In OpenSearch, this terms aggregation may throw away some buckets.
104+
105+
## Sorting by ascending doc_count may produce inaccurate results
106+
107+
Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results.
108+
109+
```ppl ignore
110+
source=hits
111+
| stats bucket_nullable=false count() as c by URL
112+
| sort + c
113+
| head 10
114+
```
115+
116+
A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results.

0 commit comments

Comments
 (0)