From 5121ed2edaa0336ddb377a9c15abef2e00957c39 Mon Sep 17 00:00:00 2001 From: Lantao Jin Date: Tue, 9 Dec 2025 17:42:07 +0800 Subject: [PATCH 1/4] [DOC] Callout the aggregation result may be approximate Signed-off-by: Lantao Jin --- docs/user/ppl/cmd/stats.rst | 33 ++++++++++++++++++++++++ docs/user/ppl/functions/aggregations.rst | 2 +- docs/user/ppl/index.rst | 2 +- 3 files changed, 35 insertions(+), 2 deletions(-) diff --git a/docs/user/ppl/cmd/stats.rst b/docs/user/ppl/cmd/stats.rst index cae65c84c79..aace92427d0 100644 --- a/docs/user/ppl/cmd/stats.rst +++ b/docs/user/ppl/cmd/stats.rst @@ -68,6 +68,39 @@ The stats command supports the following aggregation functions: For detailed documentation of each function, see `Aggregation Functions <../functions/aggregations.rst>`_. +Limitations: +============ + +The bucket aggregation result may be approximate in large dataset +----------------------------------------------------------------- + +In OpenSearch, ``doc_count`` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as ``sum`` and ``avg``) on the terms bucket aggregation may also be approximate. +For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of ``URL`` is high. + +PPL query:: + + > source=hits + | stats bucket_nullable=false count() as c by URL + | sort - c + | head 10 + +This query is pushed down to a terms bucket aggregation DSL query with ``"order": { "_count": "desc" }``. In OpenSearch, this terms aggregation may throw away some buckets. + + +Sorting by ascending doc_count may produce inaccurate results +------------------------------------------------------------- + +Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results. + +PPL query:: + + > source=hits + | stats bucket_nullable=false count() as c by URL + | sort + c + | head 10 + +A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results. + Example 1: Calculate the count of events ======================================== diff --git a/docs/user/ppl/functions/aggregations.rst b/docs/user/ppl/functions/aggregations.rst index 6605bda0765..20a236bd9e5 100644 --- a/docs/user/ppl/functions/aggregations.rst +++ b/docs/user/ppl/functions/aggregations.rst @@ -11,7 +11,7 @@ Aggregation Functions Description ============ -| Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with ``stats`` and ``eventstats`` commands to analyze and summarize data. +| Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with ``stats``, ``eventstats`` and ``streamstats`` commands to analyze and summarize data. | The following table shows how NULL/MISSING values are handled by aggregation functions: diff --git a/docs/user/ppl/index.rst b/docs/user/ppl/index.rst index 981b2de3169..aeb661e7603 100644 --- a/docs/user/ppl/index.rst +++ b/docs/user/ppl/index.rst @@ -102,7 +102,7 @@ The query start with search command and then flowing a set of command delimited * **Functions** - - `Aggregation Functions `_ + - `Aggregation Functions `_ - `Collection Functions `_ From e2cc21b64b47b25a030ed12c394c68263b4d9f75 Mon Sep 17 00:00:00 2001 From: Lantao Jin Date: Tue, 9 Dec 2025 17:47:36 +0800 Subject: [PATCH 2/4] add to limitation.rst Signed-off-by: Lantao Jin --- docs/user/ppl/cmd/stats.rst | 5 ++-- docs/user/ppl/limitations/limitations.rst | 31 +++++++++++++++++++++++ 2 files changed, 33 insertions(+), 3 deletions(-) diff --git a/docs/user/ppl/cmd/stats.rst b/docs/user/ppl/cmd/stats.rst index aace92427d0..ba0eae0f3d1 100644 --- a/docs/user/ppl/cmd/stats.rst +++ b/docs/user/ppl/cmd/stats.rst @@ -71,8 +71,8 @@ For detailed documentation of each function, see `Aggregation Functions <../func Limitations: ============ -The bucket aggregation result may be approximate in large dataset ------------------------------------------------------------------ +Bucket aggregation result may be approximate in large dataset +------------------------------------------------------------- In OpenSearch, ``doc_count`` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as ``sum`` and ``avg``) on the terms bucket aggregation may also be approximate. For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of ``URL`` is high. @@ -86,7 +86,6 @@ PPL query:: This query is pushed down to a terms bucket aggregation DSL query with ``"order": { "_count": "desc" }``. In OpenSearch, this terms aggregation may throw away some buckets. - Sorting by ascending doc_count may produce inaccurate results ------------------------------------------------------------- diff --git a/docs/user/ppl/limitations/limitations.rst b/docs/user/ppl/limitations/limitations.rst index 41d3a007d23..936e3c07d61 100644 --- a/docs/user/ppl/limitations/limitations.rst +++ b/docs/user/ppl/limitations/limitations.rst @@ -130,3 +130,34 @@ If a document contains malformed field names inside an object field, PPL ignores When ``log`` is an object field with ``enabled: false``, subfields with malformed names are ignored. **Recommendation:** Avoid using field names that contain leading dots, trailing dots, consecutive dots, or consist only of dots. This aligns with OpenSearch's default field naming requirements. + + +Bucket Aggregation Result May Be Approximate In Large Dataset +============================================================= + +In OpenSearch, ``doc_count`` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as ``sum`` and ``avg``) on the terms bucket aggregation may also be approximate. +For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of ``URL`` is high. + +PPL query:: + + > source=hits + | stats bucket_nullable=false count() as c by URL + | sort - c + | head 10 + +This query is pushed down to a terms bucket aggregation DSL query with ``"order": { "_count": "desc" }``. In OpenSearch, this terms aggregation may throw away some buckets. + + +Sorting By Ascending doc_count May Produce Inaccurate Results +============================================================= + +Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results. + +PPL query:: + + > source=hits + | stats bucket_nullable=false count() as c by URL + | sort + c + | head 10 + +A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results. From 4515592d9b339c64e0007afceed5154e84f5359c Mon Sep 17 00:00:00 2001 From: Lantao Jin Date: Wed, 10 Dec 2025 13:19:34 +0800 Subject: [PATCH 3/4] revert Signed-off-by: Lantao Jin --- docs/user/limitations/limitations.rst | 154 ++++++++++++++++++++++++++ 1 file changed, 154 insertions(+) create mode 100644 docs/user/limitations/limitations.rst diff --git a/docs/user/limitations/limitations.rst b/docs/user/limitations/limitations.rst new file mode 100644 index 00000000000..bc55b0045b8 --- /dev/null +++ b/docs/user/limitations/limitations.rst @@ -0,0 +1,154 @@ + +=========== +Limitations +=========== + +.. rubric:: Table of contents + +.. contents:: + :local: + :depth: 2 + + +Introduction +============ + +In this doc, the restrictions and limitations of SQL plugin is covered as follows. + +Limitations on Identifiers +========================== + +Using OpenSearch cluster name as dataSource name to qualify an index name, such as ``my_cluster.my_index``, is not supported for now. + +Limitations on Fields +===================== + +We are not supporting use `alias field type `_ as identifier. It will throw exception ``can't resolve Symbol``. + + +Limitations on Aggregations +=========================== + +Aggregation over expression is not supported for now. You can only apply aggregation on fields, aggregations can't accept an expression as a parameter. For example, `avg(log(age))` is not supported. + +Here's a link to the Github issue - [Issue #288](https://github.com/opendistro-for-elasticsearch/sql/issues/288). + + +Limitations on JOINs +==================== + +JOIN does not support aggregations on the joined result. The `join` query does not support aggregations on the joined result. +For example, e.g. `SELECT depo.name, avg(empo.age) FROM empo JOIN depo WHERE empo.id == depo.id GROUP BY depo.name` is not supported. + +Here's a link to the Github issue - `Issue 110 `_. + + +Limitations on Window Functions +=============================== + +For now, only the field defined in index is allowed, all the other calculated fields (calculated by scalar or aggregated functions) is not allowed. For example, either ``avg_flight_time`` or ``AVG(FlightTimeMin)`` is not accessible to the rank window definition as follows:: + + SELECT OriginCountry, AVG(FlightTimeMin) AS avg_flight_time, + RANK() OVER (ORDER BY avg_flight_time) AS rnk + FROM opensearch_dashboards_sample_data_flights + GROUP BY OriginCountry + +Another limitation is that currently window function cannot be nested in another expression, for example, ``CASE WHEN RANK() OVER(...) THEN ...``. + +Workaround for both limitations mentioned above is using a sub-query in FROM clause:: + + SELECT + SUM(t.avg_flight_time) OVER(...) + FROM ( + SELECT OriginCountry, AVG(FlightTimeMin) AS avg_flight_time, + FROM opensearch_dashboards_sample_data_flights + GROUP BY OriginCountry + ) AS t + +Limitations on Pagination +========================= + +Pagination only supports basic queries for now. The pagination query enables you to get back paginated responses. +Currently, the pagination only supports basic queries. For example, the following query returns the data with cursor id:: + + POST _plugins/_sql/ + { + "fetch_size" : 5, + "query" : "SELECT OriginCountry, DestCountry FROM opensearch_dashboards_sample_data_flights ORDER BY OriginCountry ASC" + } + +The response in JDBC format with cursor id:: + + { + "schema": [ + { + "name": "OriginCountry", + "type": "keyword" + }, + { + "name": "DestCountry", + "type": "keyword" + } + ], + "cursor": "d:eyJhIjp7fSwicyI6IkRYRjFaWEo1UVc1a1JtVjBZMmdCQUFBQUFBQUFCSllXVTJKVU4yeExiWEJSUkhsNFVrdDVXVEZSYkVKSmR3PT0iLCJjIjpbeyJuYW1lIjoiT3JpZ2luQ291bnRyeSIsInR5cGUiOiJrZXl3b3JkIn0seyJuYW1lIjoiRGVzdENvdW50cnkiLCJ0eXBlIjoia2V5d29yZCJ9XSwiZiI6MSwiaSI6ImtpYmFuYV9zYW1wbGVfZGF0YV9mbGlnaHRzIiwibCI6MTMwNTh9", + "total": 13059, + "datarows": [[ + "AE", + "CN" + ]], + "size": 1, + "status": 200 + } + +The query with `aggregation` and `join` does not support pagination for now. + +Limitations on Using Multi-valued Fields +======================================== + +OpenSearch does not natively support the ARRAY data type but does allow multi-value fields implicitly. The +SQL/PPL plugin adheres strictly to the data type semantics defined in index mappings. When parsing OpenSearch +responses, it expects data to match the declared type and does not account for data in array format. If the +plugins.query.field_type_tolerance setting is enabled, the SQL/PPL plugin will handle array datasets by returning +scalar data types, allowing basic queries (e.g., SELECT * FROM tbl WHERE condition). However, using multi-value +fields in expressions or functions will result in exceptions. If this setting is disabled or absent, only the +first element of an array is returned, preserving the default behavior. + +For example, the following query tries to calculate the absolute value of a field that contains arrays of +longs:: + + POST _plugins/_sql/ + { + "query": "SELECT id, ABS(long_array) FROM multi_value_long" + } +The response in JSON format is:: + + { + "error": { + "reason": "Invalid SQL query", + "details": "invalid to get longValue from value of type ARRAY", + "type": "ExpressionEvaluationException" + }, + "status": 400 + } + +Limitations on Calcite Engine +============================= + +Since 3.0.0, we introduce Apache Calcite as an experimental query engine. Please see `introduce v3 engine <../../../dev/intro-v3-engine.md>`_. +For the following functionalities, the query will be forwarded to the V2 query engine. It means following functionalities cannot work together with new PPL commands/functions introduced in 3.0.0 and above. + +* All SQL queries + +* PPL queries against non-OpenSearch data sources + +* ``dedup`` with ``consecutive=true`` + +* Search relevant commands + + * AD + * ML + * Kmeans + +* ``show datasources`` command + +* Commands with ``fetch_size`` parameter From c36888c956a34a13fd244eba2db12289a9d957fc Mon Sep 17 00:00:00 2001 From: Lantao Jin Date: Thu, 11 Dec 2025 12:42:53 +0800 Subject: [PATCH 4/4] add ignore format Signed-off-by: Lantao Jin --- docs/user/ppl/cmd/stats.md | 4 ++-- docs/user/ppl/limitations/limitations.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/user/ppl/cmd/stats.md b/docs/user/ppl/cmd/stats.md index e801e0f3604..000d910d97b 100644 --- a/docs/user/ppl/cmd/stats.md +++ b/docs/user/ppl/cmd/stats.md @@ -56,7 +56,7 @@ For detailed documentation of each function, see [Aggregation Functions](../func In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate. For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high. -```ppl +```ppl ignore source=hits | stats bucket_nullable=false count() as c by URL | sort - c @@ -69,7 +69,7 @@ This query is pushed down to a terms bucket aggregation DSL query with `"order": Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results. -```ppl +```ppl ignore source=hits | stats bucket_nullable=false count() as c by URL | sort + c diff --git a/docs/user/ppl/limitations/limitations.md b/docs/user/ppl/limitations/limitations.md index 910549adc8c..53adc1f072d 100644 --- a/docs/user/ppl/limitations/limitations.md +++ b/docs/user/ppl/limitations/limitations.md @@ -93,7 +93,7 @@ When ``log`` is an object field with ``enabled: false``, subfields with malforme In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate. For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high. -```ppl +```ppl ignore source=hits | stats bucket_nullable=false count() as c by URL | sort - c @@ -106,7 +106,7 @@ This query is pushed down to a terms bucket aggregation DSL query with `"order": Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results. -```ppl +```ppl ignore source=hits | stats bucket_nullable=false count() as c by URL | sort + c