From 5121ed2edaa0336ddb377a9c15abef2e00957c39 Mon Sep 17 00:00:00 2001
From: Lantao Jin <ltjin@amazon.com>
Date: Tue, 9 Dec 2025 17:42:07 +0800
Subject: [PATCH 1/4] [DOC] Callout the aggregation result may be approximate

Signed-off-by: Lantao Jin <ltjin@amazon.com>
---
 docs/user/ppl/cmd/stats.rst              | 33 ++++++++++++++++++++++++
 docs/user/ppl/functions/aggregations.rst |  2 +-
 docs/user/ppl/index.rst                  |  2 +-
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/docs/user/ppl/cmd/stats.rst b/docs/user/ppl/cmd/stats.rst
index cae65c84c79..aace92427d0 100644
--- a/docs/user/ppl/cmd/stats.rst
+++ b/docs/user/ppl/cmd/stats.rst
@@ -68,6 +68,39 @@ The stats command supports the following aggregation functions:
 
 For detailed documentation of each function, see `Aggregation Functions <../functions/aggregations.rst>`_.
 
+Limitations:
+============
+
+The bucket aggregation result may be approximate in large dataset
+-----------------------------------------------------------------
+
+In OpenSearch, ``doc_count`` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as ``sum`` and ``avg``) on the terms bucket aggregation may also be approximate.
+For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of ``URL`` is high.
+
+PPL query::
+
+    > source=hits
+      | stats bucket_nullable=false count() as c by URL
+      | sort - c
+      | head 10
+
+This query is pushed down to a terms bucket aggregation DSL query with ``"order": { "_count": "desc" }``. In OpenSearch, this terms aggregation may throw away some buckets.
+
+
+Sorting by ascending doc_count may produce inaccurate results
+-------------------------------------------------------------
+
+Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results.
+
+PPL query::
+
+    > source=hits
+      | stats bucket_nullable=false count() as c by URL
+      | sort + c
+      | head 10
+
+A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results.
+
 Example 1: Calculate the count of events
 ========================================
 
diff --git a/docs/user/ppl/functions/aggregations.rst b/docs/user/ppl/functions/aggregations.rst
index 6605bda0765..20a236bd9e5 100644
--- a/docs/user/ppl/functions/aggregations.rst
+++ b/docs/user/ppl/functions/aggregations.rst
@@ -11,7 +11,7 @@ Aggregation Functions
 
 Description
 ============
-| Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with ``stats`` and ``eventstats`` commands to analyze and summarize data.
+| Aggregation functions perform calculations across multiple rows to return a single result value. These functions are used with ``stats``, ``eventstats`` and ``streamstats`` commands to analyze and summarize data.
 
 | The following table shows how NULL/MISSING values are handled by aggregation functions:
 
diff --git a/docs/user/ppl/index.rst b/docs/user/ppl/index.rst
index 981b2de3169..aeb661e7603 100644
--- a/docs/user/ppl/index.rst
+++ b/docs/user/ppl/index.rst
@@ -102,7 +102,7 @@ The query start with search command and then flowing a set of command delimited
 
 * **Functions**
 
-  - `Aggregation Functions <functions/aggregation.rst>`_
+  - `Aggregation Functions <functions/aggregations.rst>`_
 
   - `Collection Functions <functions/collection.rst>`_
 

From e2cc21b64b47b25a030ed12c394c68263b4d9f75 Mon Sep 17 00:00:00 2001
From: Lantao Jin <ltjin@amazon.com>
Date: Tue, 9 Dec 2025 17:47:36 +0800
Subject: [PATCH 2/4] add to limitation.rst

Signed-off-by: Lantao Jin <ltjin@amazon.com>
---
 docs/user/ppl/cmd/stats.rst               |  5 ++--
 docs/user/ppl/limitations/limitations.rst | 31 +++++++++++++++++++++++
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/docs/user/ppl/cmd/stats.rst b/docs/user/ppl/cmd/stats.rst
index aace92427d0..ba0eae0f3d1 100644
--- a/docs/user/ppl/cmd/stats.rst
+++ b/docs/user/ppl/cmd/stats.rst
@@ -71,8 +71,8 @@ For detailed documentation of each function, see `Aggregation Functions <../func
 Limitations:
 ============
 
-The bucket aggregation result may be approximate in large dataset
------------------------------------------------------------------
+Bucket aggregation result may be approximate in large dataset
+-------------------------------------------------------------
 
 In OpenSearch, ``doc_count`` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as ``sum`` and ``avg``) on the terms bucket aggregation may also be approximate.
 For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of ``URL`` is high.
@@ -86,7 +86,6 @@ PPL query::
 
 This query is pushed down to a terms bucket aggregation DSL query with ``"order": { "_count": "desc" }``. In OpenSearch, this terms aggregation may throw away some buckets.
 
-
 Sorting by ascending doc_count may produce inaccurate results
 -------------------------------------------------------------
 
diff --git a/docs/user/ppl/limitations/limitations.rst b/docs/user/ppl/limitations/limitations.rst
index 41d3a007d23..936e3c07d61 100644
--- a/docs/user/ppl/limitations/limitations.rst
+++ b/docs/user/ppl/limitations/limitations.rst
@@ -130,3 +130,34 @@ If a document contains malformed field names inside an object field, PPL ignores
 When ``log`` is an object field with ``enabled: false``, subfields with malformed names are ignored.
 
 **Recommendation:** Avoid using field names that contain leading dots, trailing dots, consecutive dots, or consist only of dots. This aligns with OpenSearch's default field naming requirements.
+
+
+Bucket Aggregation Result May Be Approximate In Large Dataset
+=============================================================
+
+In OpenSearch, ``doc_count`` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as ``sum`` and ``avg``) on the terms bucket aggregation may also be approximate.
+For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of ``URL`` is high.
+
+PPL query::
+
+    > source=hits
+      | stats bucket_nullable=false count() as c by URL
+      | sort - c
+      | head 10
+
+This query is pushed down to a terms bucket aggregation DSL query with ``"order": { "_count": "desc" }``. In OpenSearch, this terms aggregation may throw away some buckets.
+
+
+Sorting By Ascending doc_count May Produce Inaccurate Results
+=============================================================
+
+Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results.
+
+PPL query::
+
+    > source=hits
+      | stats bucket_nullable=false count() as c by URL
+      | sort + c
+      | head 10
+
+A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results.

From 4515592d9b339c64e0007afceed5154e84f5359c Mon Sep 17 00:00:00 2001
From: Lantao Jin <ltjin@amazon.com>
Date: Wed, 10 Dec 2025 13:19:34 +0800
Subject: [PATCH 3/4] revert

Signed-off-by: Lantao Jin <ltjin@amazon.com>
---
 docs/user/limitations/limitations.rst | 154 ++++++++++++++++++++++++++
 1 file changed, 154 insertions(+)
 create mode 100644 docs/user/limitations/limitations.rst

diff --git a/docs/user/limitations/limitations.rst b/docs/user/limitations/limitations.rst
new file mode 100644
index 00000000000..bc55b0045b8
--- /dev/null
+++ b/docs/user/limitations/limitations.rst
@@ -0,0 +1,154 @@
+
+===========
+Limitations
+===========
+
+.. rubric:: Table of contents
+
+.. contents::
+   :local:
+   :depth: 2
+
+
+Introduction
+============
+
+In this doc, the restrictions and limitations of SQL plugin is covered as follows.
+
+Limitations on Identifiers
+==========================
+
+Using OpenSearch cluster name as dataSource name to qualify an index name, such as ``my_cluster.my_index``, is not supported for now.
+
+Limitations on Fields
+=====================
+
+We are not supporting use `alias field type <https://www.elastic.co/guide/en/elasticsearch/reference/current/alias.html>`_ as identifier. It will throw exception ``can't resolve Symbol``.
+
+
+Limitations on Aggregations
+===========================
+
+Aggregation over expression is not supported for now. You can only apply aggregation on fields, aggregations can't accept an expression as a parameter. For example, `avg(log(age))` is not supported.
+
+Here's a link to the Github issue - [Issue #288](https://github.com/opendistro-for-elasticsearch/sql/issues/288).
+
+
+Limitations on JOINs
+====================
+
+JOIN does not support aggregations on the joined result. The `join` query does not support aggregations on the joined result.
+For example, e.g. `SELECT depo.name, avg(empo.age) FROM empo JOIN depo WHERE empo.id == depo.id GROUP BY depo.name` is not supported.
+
+Here's a link to the Github issue - `Issue 110 <https://github.com/opendistro-for-elasticsearch/sql/issues/110>`_.
+
+
+Limitations on Window Functions
+===============================
+
+For now, only the field defined in index is allowed, all the other calculated fields (calculated by scalar or aggregated functions) is not allowed. For example, either ``avg_flight_time`` or ``AVG(FlightTimeMin)`` is not accessible to the rank window definition as follows::
+
+    SELECT OriginCountry, AVG(FlightTimeMin) AS avg_flight_time,
+           RANK() OVER (ORDER BY avg_flight_time) AS rnk
+    FROM opensearch_dashboards_sample_data_flights
+    GROUP BY OriginCountry
+
+Another limitation is that currently window function cannot be nested in another expression, for example, ``CASE WHEN RANK() OVER(...) THEN ...``.
+
+Workaround for both limitations mentioned above is using a sub-query in FROM clause::
+
+    SELECT
+      SUM(t.avg_flight_time) OVER(...)
+    FROM (
+        SELECT OriginCountry, AVG(FlightTimeMin) AS avg_flight_time,
+        FROM opensearch_dashboards_sample_data_flights
+        GROUP BY OriginCountry
+    ) AS t
+
+Limitations on Pagination
+=========================
+
+Pagination only supports basic queries for now. The pagination query enables you to get back paginated responses.
+Currently, the pagination only supports basic queries. For example, the following query returns the data with cursor id::
+
+    POST _plugins/_sql/
+    {
+      "fetch_size" : 5,
+      "query" : "SELECT OriginCountry, DestCountry FROM opensearch_dashboards_sample_data_flights ORDER BY OriginCountry ASC"
+    }
+
+The response in JDBC format with cursor id::
+
+    {
+      "schema": [
+        {
+          "name": "OriginCountry",
+          "type": "keyword"
+        },
+        {
+          "name": "DestCountry",
+          "type": "keyword"
+        }
+      ],
+      "cursor": "d:eyJhIjp7fSwicyI6IkRYRjFaWEo1UVc1a1JtVjBZMmdCQUFBQUFBQUFCSllXVTJKVU4yeExiWEJSUkhsNFVrdDVXVEZSYkVKSmR3PT0iLCJjIjpbeyJuYW1lIjoiT3JpZ2luQ291bnRyeSIsInR5cGUiOiJrZXl3b3JkIn0seyJuYW1lIjoiRGVzdENvdW50cnkiLCJ0eXBlIjoia2V5d29yZCJ9XSwiZiI6MSwiaSI6ImtpYmFuYV9zYW1wbGVfZGF0YV9mbGlnaHRzIiwibCI6MTMwNTh9",
+      "total": 13059,
+      "datarows": [[
+        "AE",
+        "CN"
+      ]],
+      "size": 1,
+      "status": 200
+    }
+
+The query with `aggregation` and `join` does not support pagination for now.
+
+Limitations on Using Multi-valued Fields
+========================================
+
+OpenSearch does not natively support the ARRAY data type but does allow multi-value fields implicitly. The
+SQL/PPL plugin adheres strictly to the data type semantics defined in index mappings. When parsing OpenSearch
+responses, it expects data to match the declared type and does not account for data in array format. If the
+plugins.query.field_type_tolerance setting is enabled, the SQL/PPL plugin will handle array datasets by returning
+scalar data types, allowing basic queries (e.g., SELECT * FROM tbl WHERE condition). However, using multi-value
+fields in expressions or functions will result in exceptions. If this setting is disabled or absent, only the
+first element of an array is returned, preserving the default behavior.
+
+For example, the following query tries to calculate the absolute value of a field that contains arrays of
+longs::
+
+    POST _plugins/_sql/
+    {
+      "query": "SELECT id, ABS(long_array) FROM multi_value_long"
+    }
+The response in JSON format is::
+
+    {
+      "error": {
+        "reason": "Invalid SQL query",
+        "details": "invalid to get longValue from value of type ARRAY",
+        "type": "ExpressionEvaluationException"
+      },
+      "status": 400
+    }
+
+Limitations on Calcite Engine
+=============================
+
+Since 3.0.0, we introduce Apache Calcite as an experimental query engine. Please see `introduce v3 engine <../../../dev/intro-v3-engine.md>`_.
+For the following functionalities, the query will be forwarded to the V2 query engine. It means following functionalities cannot work together with new PPL commands/functions introduced in 3.0.0 and above.
+
+* All SQL queries
+
+* PPL queries against non-OpenSearch data sources
+
+* ``dedup`` with ``consecutive=true``
+
+* Search relevant commands
+
+    * AD
+    * ML
+    * Kmeans
+
+* ``show datasources`` command
+
+* Commands with ``fetch_size`` parameter

From c36888c956a34a13fd244eba2db12289a9d957fc Mon Sep 17 00:00:00 2001
From: Lantao Jin <ltjin@amazon.com>
Date: Thu, 11 Dec 2025 12:42:53 +0800
Subject: [PATCH 4/4] add ignore format

Signed-off-by: Lantao Jin <ltjin@amazon.com>
---
 docs/user/ppl/cmd/stats.md               | 4 ++--
 docs/user/ppl/limitations/limitations.md | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/user/ppl/cmd/stats.md b/docs/user/ppl/cmd/stats.md
index e801e0f3604..000d910d97b 100644
--- a/docs/user/ppl/cmd/stats.md
+++ b/docs/user/ppl/cmd/stats.md
@@ -56,7 +56,7 @@ For detailed documentation of each function, see [Aggregation Functions](../func
 In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate.
 For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high.
 
-```ppl
+```ppl ignore
 source=hits
 | stats bucket_nullable=false count() as c by URL
 | sort - c
@@ -69,7 +69,7 @@ This query is pushed down to a terms bucket aggregation DSL query with `"order":
 
 Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results.
 
-```ppl
+```ppl ignore
 source=hits
 | stats bucket_nullable=false count() as c by URL
 | sort + c
diff --git a/docs/user/ppl/limitations/limitations.md b/docs/user/ppl/limitations/limitations.md
index 910549adc8c..53adc1f072d 100644
--- a/docs/user/ppl/limitations/limitations.md
+++ b/docs/user/ppl/limitations/limitations.md
@@ -93,7 +93,7 @@ When ``log`` is an object field with ``enabled: false``, subfields with malforme
 In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate.
 For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high.
 
-```ppl
+```ppl ignore
 source=hits
 | stats bucket_nullable=false count() as c by URL
 | sort - c
@@ -106,7 +106,7 @@ This query is pushed down to a terms bucket aggregation DSL query with `"order":
 
 Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results.
 
-```ppl
+```ppl ignore
 source=hits
 | stats bucket_nullable=false count() as c by URL
 | sort + c