Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with the stats, eventstats, and streamstats commands to analyze and summarize data.
The following table shows how NULL and missing values are handled by aggregation functions.
| Function | null | Missing |
|---|---|---|
COUNT |
Not counted | Not counted |
SUM |
Ignored | Ignored |
AVG |
Ignored | Ignored |
MAX |
Ignored | Ignored |
MIN |
Ignored | Ignored |
FIRST |
Ignored | Ignored |
LAST |
Ignored | Ignored |
LIST |
Ignored | Ignored |
VALUES |
Ignored | Ignored |
The following aggregation functions are available in PPL for data analysis and summarization.
Usage: COUNT(expr), C(expr), c(expr), count(expr)
Counts the number of expr values in the retrieved rows. C(), c(), and count() are available as abbreviations for COUNT(). For filtered counting, use an eval expression to specify the filtering condition.
Parameters:
expr(Optional): The expression whose values are to be counted.
Return type: LONG
source=accounts
| stats count(), c(), count, c
The query returns the following results:
fetched rows / total rows = 1/1
+---------+-----+-------+---+
| count() | c() | count | c |
|---------+-----+-------+---|
| 4 | 4 | 4 | 4 |
+---------+-----+-------+---+
The following example counts only records that match a specific condition:
source=accounts
| stats count(eval(age > 30)) as mature_users
The query returns the following results:
fetched rows / total rows = 1/1
+--------------+
| mature_users |
|--------------|
| 3 |
+--------------+
Usage: SUM(expr)
Returns the sum of expr values.
Parameters:
expr(Required): The expression whose values are to be summed.
Return type: Same as input type (INTEGER, LONG, FLOAT, or DOUBLE)
source=accounts
| stats sum(age) by gender
The query returns the following results:
fetched rows / total rows = 2/2
+----------+--------+
| sum(age) | gender |
|----------+--------|
| 28 | F |
| 101 | M |
+----------+--------+
Usage: AVG(expr)
Returns the average value of expr.
Parameters:
expr(Required): The expression whose values are to be averaged.
Return type: DOUBLE for numeric inputs; same as input type for DATE, TIME, or TIMESTAMP inputs
source=accounts
| stats avg(age) by gender
The query returns the following results:
fetched rows / total rows = 2/2
+--------------------+--------+
| avg(age) | gender |
|--------------------+--------|
| 28.0 | F |
| 33.666666666666664 | M |
+--------------------+--------+
Usage: MAX(expr)
Returns the maximum value of expr. For non-numeric fields, this function returns the value that comes last in alphabetical order.
Parameters:
expr(Required): The expression for which to find the maximum value.
Return type: Same as input type
source=accounts
| stats max(age)
The query returns the following results:
fetched rows / total rows = 1/1
+----------+
| max(age) |
|----------|
| 36 |
+----------+
The following example returns the value from the firstname text field that comes last in alphabetical order:
source=accounts
| stats max(firstname)
The query returns the following results:
fetched rows / total rows = 1/1
+----------------+
| max(firstname) |
|----------------|
| Nanette |
+----------------+
Usage: MIN(expr)
Returns the minimum value of expr. For non-numeric fields, this function returns the value that comes first in alphabetical order.
Parameters:
expr(Required): The expression for which to find the minimum value.
Return type: Same as input type
source=accounts
| stats min(age)
The query returns the following results:
fetched rows / total rows = 1/1
+----------+
| min(age) |
|----------|
| 28 |
+----------+
The following example returns the value from the firstname text field that comes first in alphabetical order:
source=accounts
| stats min(firstname)
The query returns the following results:
fetched rows / total rows = 1/1
+----------------+
| min(firstname) |
|----------------|
| Amber |
+----------------+
Usage: VAR_SAMP(expr)
Returns the sample variance of expr.
Parameters:
expr(Required): The expression for which to calculate the sample variance.
Return type: DOUBLE
source=accounts
| stats var_samp(age)
The query returns the following results:
fetched rows / total rows = 1/1
+--------------------+
| var_samp(age) |
|--------------------|
| 10.916666666666666 |
+--------------------+
Usage: VAR_POP(expr)
Returns the population variance of expr.
Parameters:
expr(Required): The expression for which to calculate the population variance.
Return type: DOUBLE
source=accounts
| stats var_pop(age)
The query returns the following results:
fetched rows / total rows = 1/1
+--------------+
| var_pop(age) |
|--------------|
| 8.1875 |
+--------------+
Usage: STDDEV_SAMP(expr)
Returns the sample standard deviation of expr.
Parameters:
expr(Required): The expression for which to calculate the sample standard deviation.
Return type: DOUBLE
source=accounts
| stats stddev_samp(age)
The query returns the following results:
fetched rows / total rows = 1/1
+-------------------+
| stddev_samp(age) |
|-------------------|
| 3.304037933599835 |
+-------------------+
Usage: STDDEV_POP(expr)
Returns the population standard deviation of expr.
Parameters:
expr(Required): The expression for which to calculate the population standard deviation.
Return type: DOUBLE
source=accounts
| stats stddev_pop(age)
The query returns the following results:
fetched rows / total rows = 1/1
+--------------------+
| stddev_pop(age) |
|--------------------|
| 2.8613807855648994 |
+--------------------+
Usage: DISTINCT_COUNT(expr), DC(expr)
Returns the approximate number of distinct values using the HyperLogLog++ algorithm. Both functions are equivalent. For more information about algorithm accuracy and precision control, see Controlling precision.
Parameters:
expr(Required): The expression for which to count distinct values.
Return type: LONG
source=accounts
| stats dc(state) as distinct_states, distinct_count(state) as dc_states_alt by gender
The query returns the following results:
fetched rows / total rows = 2/2
+-----------------+---------------+--------+
| distinct_states | dc_states_alt | gender |
|-----------------+---------------+--------|
| 1 | 1 | F |
| 3 | 3 | M |
+-----------------+---------------+--------+
Usage: DISTINCT_COUNT_APPROX(expr)
Returns the approximate count of distinct values in expr using the HyperLogLog++ algorithm.
Parameters:
expr(Required): The expression for which to count approximate distinct values.
Return type: LONG
source=accounts
| stats distinct_count_approx(gender)
The query returns the following results:
fetched rows / total rows = 1/1
+-------------------------------+
| distinct_count_approx(gender) |
|-------------------------------|
| 2 |
+-------------------------------+
Usage: EARLIEST(field [, time_field])
Returns the earliest value of a field based on timestamp ordering.
Parameters:
field(Required): The field for which to return the earliest value.time_field(Optional): The field to use for time-based ordering. Defaults to@timestampif not specified.
Return type: Same as input field type
source=events
| stats earliest(message) by host
| sort host
The query returns the following results:
fetched rows / total rows = 2/2
+-------------------+---------+
| earliest(message) | host |
|-------------------+---------|
| Starting up | server1 |
| Initializing | server2 |
+-------------------+---------+
The following example uses a custom time field instead of the default @timestamp field for ordering:
source=events
| stats earliest(status, event_time) by category
| sort category
The query returns the following results:
fetched rows / total rows = 2/2
+------------------------------+----------+
| earliest(status, event_time) | category |
|------------------------------+----------|
| pending | orders |
| active | users |
+------------------------------+----------+
Usage: LATEST(field [, time_field])
Returns the latest value of a field based on timestamp ordering.
Parameters:
field(Required): The field for which to return the latest value.time_field(Optional): The field to use for time-based ordering. Defaults to@timestampif not specified.
Return type: Same as input field type
source=events
| stats latest(message) by host
| sort host
The query returns the following results:
fetched rows / total rows = 2/2
+------------------+---------+
| latest(message) | host |
|------------------+---------|
| Shutting down | server1 |
| Maintenance mode | server2 |
+------------------+---------+
The following example uses a custom time field instead of the default @timestamp field for ordering:
source=events
| stats latest(status, event_time) by category
| sort category
The query returns the following results:
fetched rows / total rows = 2/2
+----------------------------+----------+
| latest(status, event_time) | category |
|----------------------------+----------|
| cancelled | orders |
| inactive | users |
+----------------------------+----------+
Usage: TAKE(field [, size])
Returns the original values from a field. This function does not guarantee the order of the returned values.
Parameters:
field(Required): A text field from which to extract values.size(Optional): The number of values to return. Defaults to10.
Return type: ARRAY
source=accounts
| stats take(firstname)
The query returns the following results:
fetched rows / total rows = 1/1
+-----------------------------+
| take(firstname) |
|-----------------------------|
| [Amber,Hattie,Nanette,Dale] |
+-----------------------------+
Usage: PERCENTILE(expr, percent), PERCENTILE_APPROX(expr, percent)
Returns the approximate percentile value of expr at the specified percentage.
Parameters:
expr(Required): The expression for which to calculate the percentile.percent(Required): A constant number between0and100.
Return type: Same as input type
Starting in version 3.1.0, the percentile implementation switched from AVLTreeDigest to MergingDigest. For more information, see the corresponding issue.
{: .note}
source=accounts
| stats percentile(age, 90) by gender
The query returns the following results:
fetched rows / total rows = 2/2
+---------------------+--------+
| percentile(age, 90) | gender |
|---------------------+--------|
| 28 | F |
| 36 | M |
+---------------------+--------+
For convenience, OpenSearch PPL provides shortcut functions for common percentiles:
PERC<percent>(expr)- Equivalent toPERCENTILE(expr, <percent>).P<percent>(expr)- Equivalent toPERCENTILE(expr, <percent>).
Both integer and decimal percentiles from 0 to 100 are supported (for example, PERC95, P99.5):
source=accounts
| stats perc99.5(age);
The query returns the following results:
fetched rows / total rows = 1/1
+---------------+
| perc99.5(age) |
|---------------|
| 36 |
+---------------+
source=accounts
| stats p50(age);
The query returns the following results:
fetched rows / total rows = 1/1
+----------+
| p50(age) |
|----------|
| 33 |
+----------+
Usage: MEDIAN(expr)
Returns the median (50th percentile) value of expr. This is equivalent to PERCENTILE(expr, 50).
Parameters:
expr(Required): The expression for which to calculate the median.
Return type: Same as input type
source=accounts
| stats median(age)
The query returns the following results:
fetched rows / total rows = 1/1
+-------------+
| median(age) |
|-------------|
| 33 |
+-------------+
Usage: FIRST(field)
Returns the first non-null value of a field based on natural document order. Returns NULL if no records exist or if all records have NULL values for the field.
Parameters:
field(Required): The field for which to return the first value.
Return type: Same as input field type
source=accounts
| stats first(firstname) by gender
The query returns the following results:
fetched rows / total rows = 2/2
+------------------+--------+
| first(firstname) | gender |
|------------------+--------|
| Nanette | F |
| Amber | M |
+------------------+--------+
Usage: LAST(field)
Returns the last non-null value of a field based on natural document order. Returns NULL if no records exist or if all records have NULL values for the field.
Parameters:
field(Required): The field for which to return the last value.
Return type: Same as input field type
source=accounts
| stats last(firstname) by gender
The query returns the following results:
fetched rows / total rows = 2/2
+-----------------+--------+
| last(firstname) | gender |
|-----------------+--------|
| Nanette | F |
| Dale | M |
+-----------------+--------+
Usage: LIST(expr)
Collects all values from the specified expression into an array. Values are converted to strings, NULL values are filtered out, and duplicates are preserved. This function returns up to 100 values without a guaranteed order.
Parameters:
expr(Required): The field expression from which to collect values.
Return type: ARRAY
This aggregation function does not support array, struct, or object field types. {: .note}
The following example collects all values from a string field into an array:
source=accounts
| stats list(firstname)
The query returns the following results:
fetched rows / total rows = 1/1
+-----------------------------+
| list(firstname) |
|-----------------------------|
| [Amber,Hattie,Nanette,Dale] |
+-----------------------------+
Usage: VALUES(expr)
Collects all unique values from the specified expression into a sorted array. Values are converted to strings, NULL values are filtered out, and duplicates are removed.
Parameters:
expr(Required): The expression from which to collect unique values.
Return type: ARRAY
The
plugins.ppl.values.max.limitsetting controls the maximum number of unique values returned:
- The default value is 0, which returns an unlimited number of values.
- Setting this to any positive integer limits the number of unique values.
- See the PPL Settings documentation for more details
The following example collects unique values from a string field into a sorted array:
source=accounts
| stats values(firstname)
The query returns the following results:
fetched rows / total rows = 1/1
+-----------------------------+
| values(firstname) |
|-----------------------------|
| [Amber,Dale,Hattie,Nanette] |
+-----------------------------+