Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/user/ppl/cmd/stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,36 @@ The stats command supports the following aggregation functions:
* VALUES: Collect unique values into sorted array

For detailed documentation of each function, see [Aggregation Functions](../functions/aggregations.md).

## Limitations

### Bucket aggregation result may be approximate in large dataset

In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate.
For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high.

```ppl ignore
source=hits
| stats bucket_nullable=false count() as c by URL
| sort - c
| head 10
```

This query is pushed down to a terms bucket aggregation DSL query with `"order": { "_count": "desc" }`. In OpenSearch, this terms aggregation may throw away some buckets.

### Sorting by ascending doc_count may produce inaccurate results

Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results.

```ppl ignore
source=hits
| stats bucket_nullable=false count() as c by URL
| sort + c
| head 10
```

A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results.

## Example 1: Calculate the count of events

This example shows calculating the count of events in the accounts.
Expand Down
2 changes: 1 addition & 1 deletion docs/user/ppl/functions/aggregations.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Description

Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with `stats` and `eventstats` commands to analyze and summarize data.
Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with `stats`, `eventstats` and `streamstats` commands to analyze and summarize data.
The following table shows how NULL/MISSING values are handled by aggregation functions:

| Function | NULL | MISSING |
Expand Down
27 changes: 27 additions & 0 deletions docs/user/ppl/limitations/limitations.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,30 @@ If a document contains malformed field names inside an object field, PPL ignores
When ``log`` is an object field with ``enabled: false``, subfields with malformed names are ignored.

**Recommendation:** Avoid using field names that contain leading dots, trailing dots, consecutive dots, or consist only of dots. This aligns with OpenSearch's default field naming requirements.

## Bucket aggregation result may be approximate in large dataset

In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate.
For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high.

```ppl ignore
source=hits
| stats bucket_nullable=false count() as c by URL
| sort - c
| head 10
```

This query is pushed down to a terms bucket aggregation DSL query with `"order": { "_count": "desc" }`. In OpenSearch, this terms aggregation may throw away some buckets.

## Sorting by ascending doc_count may produce inaccurate results

Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results.

```ppl ignore
source=hits
| stats bucket_nullable=false count() as c by URL
| sort + c
| head 10
```

A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results.
Loading