apache
diff --git a/‎docs-next/key-features/analytic-functions.mdx‎
Lines changed: 19 additions & 14 deletions b/‎docs-next/key-features/analytic-functions.mdx‎
Lines changed: 19 additions & 14 deletions
diff --git a/‎docs-next/key-features/batch-load.mdx‎
Lines changed: 11 additions & 7 deletions b/‎docs-next/key-features/batch-load.mdx‎
Lines changed: 11 additions & 7 deletions
@@ -20,42 +20,44 @@ featureCard:
     - performance
 ---
 
-> **TL;DR** Analytic functions (window functions) compute a value for every row of a result set without collapsing rows the way `GROUP BY` does. Doris supports the standard set, `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `NTILE`, `LAG`, `LEAD`, `FIRST_VALUE`, `LAST_VALUE`, `NTH_VALUE`, `PERCENT_RANK`, `CUME_DIST`, plus every aggregate (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`) used with `OVER`. The Nereids optimizer also rewrites `WHERE row_num <= K` into a partition-aware Top-N below the window, so a "top 10 per category" query never sorts the whole partition.
+> **TL;DR** Apache Doris analytic functions (window functions) compute a value for every row of a result set without collapsing rows the way `GROUP BY` does. Apache Doris supports the standard set, `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `NTILE`, `LAG`, `LEAD`, `FIRST_VALUE`, `LAST_VALUE`, `NTH_VALUE`, `PERCENT_RANK`, `CUME_DIST`, plus every aggregate (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`) used with `OVER`. The Nereids optimizer also rewrites `WHERE row_num <= K` into a partition-aware Top-N below the window, so a "top 10 per category" query never sorts the whole partition.
 
 ![Apache Doris Analytic Functions: SQL window functions in Doris rank, accumulate, lag, and average across rows; the planner pushes Top-N filters down into the window operator.](/images/next/key-features/analytic-functions.jpg)
-## Why it matters {#why}
+## Why use analytic functions in Apache Doris? {#why}
 
-Three questions show up in almost every analytics workload, and `GROUP BY` answers none of them on its own.
+Apache Doris analytic functions answer three questions that show up in almost every analytics workload and that `GROUP BY` cannot answer on its own.
 
 - "Number every order per customer in time order, then keep the first 10." A self-join on `(customer_id, order_date)` works but reads the table twice.
 - "What was the running total of sales by day, and the 7-day moving average?" `GROUP BY` collapses the rows you still want to see.
 - "How does each row compare to the previous row?" Year-over-year, period-over-period, gap-between-events, this is the lag/lead pattern, and writing it without window functions means a self-join with an offset predicate that the planner cannot optimize.
 
-Analytic functions answer all three in one pass over the data, with no self-join and the original rows preserved.
+Apache Doris analytic functions answer all three in one pass over the data, with no self-join and the original rows preserved.
 
-## What it is {#what}
+## What are Apache Doris analytic functions? {#what}
 
-An analytic function is a SQL function that, for each input row, computes a value over a *window* of related rows defined by an `OVER` clause. The window can be the whole partition, a fixed range around the current row, or anything in between. The output has the same number of rows as the input.
+An Apache Doris analytic function is a SQL function that, for each input row, computes a value over a *window* of related rows defined by an `OVER` clause. The window can be the whole partition, a fixed range around the current row, or anything in between. The output has the same number of rows as the input.
 
 ```
 function(args) OVER ( [PARTITION BY ...] [ORDER BY ...] [<frame>] )
 ```
 
 **Key terms**
 
-- **`OVER` clause**: tells Doris this call is a window function rather than a regular aggregate. Required.
+- **`OVER` clause**: tells Apache Doris this call is a window function rather than a regular aggregate. Required.
 - **`PARTITION BY`**: splits the input into independent groups. Each partition is computed on its own. Different from table partitions, this is a runtime concept.
 - **`ORDER BY` (inside `OVER`)**: orders rows within each partition. `LAG`, `LEAD`, `ROW_NUMBER`, `RANK`, and any frame with `PRECEDING`/`FOLLOWING` need it.
-- **Window frame**: the slice of the partition the function reads for the current row. Doris supports `ROWS BETWEEN ... PRECEDING/FOLLOWING/CURRENT ROW/UNBOUNDED ...` and a restricted form of `RANGE`.
+- **Window frame**: the slice of the partition the function reads for the current row. Apache Doris supports `ROWS BETWEEN ... PRECEDING/FOLLOWING/CURRENT ROW/UNBOUNDED ...` and a restricted form of `RANGE`.
 - **PartitionTopN**: an internal operator the planner inserts when it can prove a `WHERE rank <= K` filter only needs the top K rows per partition. Source: `CreatePartitionTopNFromWindow.java`.
 
-## How it works {#how}
+## How do Apache Doris analytic functions work? {#how}
+
+Apache Doris analytic functions run through a five-step pipeline: plan, shuffle, sort, evaluate, and push down.
 
 1. **Plan.** The optimizer parses the `OVER` clause into a `WindowExpression`, normalizes the frame (`CheckAndStandardizeWindowFunctionAndFrame`), and groups window calls that share the same `PARTITION BY` plus `ORDER BY` into a single physical window operator. Calls that share a partition and order do not pay for an extra sort.
 2. **Shuffle by partition.** If `PARTITION BY` is present, the engine shuffles rows so all rows for the same partition land on the same backend. With no `PARTITION BY`, the whole window runs in a single pipeline (the parallelism upper bound).
 3. **Sort within partition.** Each backend sorts its partitions by the `ORDER BY` columns. Ties produce a non-deterministic row order unless the `ORDER BY` is unique, which is why the docs warn that `SUM() OVER (ORDER BY date_col)` can return different results on tied dates.
-4. **Evaluate per row.** Doris walks each partition once, maintaining the window frame as it goes. Ranking functions emit one integer per row; aggregate functions over `ROWS UNBOUNDED PRECEDING` keep a running total; sliding-window aggregates add the new row and drop the row that fell off the back.
-5. **Push down filters and Top-N.** `CreatePartitionTopNFromWindow` turns `WHERE row_number() OVER (PARTITION BY a ORDER BY b) <= K` into a `PartitionTopN(K)` operator below the window, so each partition only carries the top K rows into the window operator. `PushDownFilterThroughWindow` lifts filters on `PARTITION BY` columns past the window, so Doris filters before sorting instead of after.
+4. **Evaluate per row.** Apache Doris walks each partition once, maintaining the window frame as it goes. Ranking functions emit one integer per row; aggregate functions over `ROWS UNBOUNDED PRECEDING` keep a running total; sliding-window aggregates add the new row and drop the row that fell off the back.
+5. **Push down filters and Top-N.** `CreatePartitionTopNFromWindow` turns `WHERE row_number() OVER (PARTITION BY a ORDER BY b) <= K` into a `PartitionTopN(K)` operator below the window, so each partition only carries the top K rows into the window operator. `PushDownFilterThroughWindow` lifts filters on `PARTITION BY` columns past the window, so Apache Doris filters before sorting instead of after.
 
 ## Quick start {#quick-start}
 
@@ -92,11 +94,13 @@ FROM orders;
 
 One pass, three windowed columns, no self-joins. The original five rows are preserved. `LAG(..., 1, 0)` returns `0` instead of `NULL` for the first row of each partition.
 
-## When to use it {#when}
+## When should you use Apache Doris analytic functions? {#when}
+
+Apache Doris analytic functions fit any row-preserving computation that depends on neighboring rows, including Top-N per group, running totals, lag/lead comparisons, percentile bucketing, and share-of-total ratios.
 
 **Good fit**
 
-- Top-N per group: "top 10 orders per customer," "best-selling product per region." Add a `WHERE rn <= 10` filter and Doris pushes a partition Top-N below the window.
+- Top-N per group: "top 10 orders per customer," "best-selling product per region." Add a `WHERE rn <= 10` filter and Apache Doris pushes a partition Top-N below the window.
 - Running totals, moving averages, and centered moving averages with `ROWS BETWEEN n PRECEDING AND m FOLLOWING`.
 - Year-over-year, day-over-day, and gap-between-events analysis with `LAG` and `LEAD`. One pass over the table replaces a self-join.
 - Bucketing for percentile or quartile reports with `NTILE`.
@@ -105,7 +109,7 @@ One pass, three windowed columns, no self-joins. The original five rows are pres
 **Not a good fit**
 
 - A pure aggregation that collapses rows. `SELECT category, SUM(amount) FROM t GROUP BY category` does the same work without sorting and without keeping every row in memory. Reach for `GROUP BY` first, and only switch to `OVER` when you also need the unaggregated columns.
-- `RANGE` frames with numeric offsets, like `RANGE BETWEEN 5 PRECEDING AND CURRENT ROW`. Doris's `RANGE` frame is restricted to `UNBOUNDED` boundaries or `CURRENT ROW`; arbitrary `RANGE n PRECEDING/FOLLOWING` is not supported. Use `ROWS` if you need a numeric offset.
+- `RANGE` frames with numeric offsets, like `RANGE BETWEEN 5 PRECEDING AND CURRENT ROW`. The Apache Doris `RANGE` frame is restricted to `UNBOUNDED` boundaries or `CURRENT ROW`; arbitrary `RANGE n PRECEDING/FOLLOWING` is not supported. Use `ROWS` if you need a numeric offset.
 - A window with no `PARTITION BY`. The whole result set lands on one pipeline, and that pipeline becomes the bottleneck on large inputs. Add a partition key whenever the workload allows it.
 - `ORDER BY` on a non-unique column when you care about deterministic output. `SUM(x) OVER (ORDER BY day)` can return different cumulative totals across runs when several rows share the same day. Add a tie-breaker; see [Window Functions Overview](../sql-manual/sql-functions/window-functions/overview).
 - Recomputing the same window result on every refresh. If the same `ROW_NUMBER()` query runs every minute against a slow-moving table, an [async materialized view](../query-acceleration/materialized-view/async-materialized-view/overview) with a partial refresh is cheaper than re-windowing each time.
@@ -118,3 +122,4 @@ One pass, three windowed columns, no self-joins. The original five rows are pres
 - [Pipeline Execution Engine: how partitioned windows get parallelized across BEs](./pipeline-execution-engine)
 - [Vectorized Execution](./vectorized-execution): the engine that runs each window's batch through SIMD-accelerated operators.
 - [Async Materialized View: precompute window results that re-run every minute](../query-acceleration/materialized-view/async-materialized-view/overview)
+- [MPP Architecture](./mpp): how window/analytic operators are shuffled and partitioned across BEs.
@@ -20,22 +20,22 @@ featureCard:
     - bulk-ingestion
 ---
 
-> **TL;DR** Batch Load is how you move large files from S3, HDFS, or another warehouse into Doris in one shot. You submit a `LOAD LABEL` statement, the FE plans the work and fans it out to BEs, and you check progress with `SHOW LOAD`. The job is asynchronous, so the client can disconnect; the label is what lets you find the result and what dedupes a retry. For [external catalogs](./multi-catalog) and TVF reads, the synchronous `INSERT INTO ... SELECT` form covers the same ground with simpler ETL.
+> **TL;DR** Apache Doris Batch Load moves large files from S3, HDFS, or another warehouse into Doris in one shot. You submit a `LOAD LABEL` statement, the FE plans the work and fans it out to BEs, and you check progress with `SHOW LOAD`. The job is asynchronous, so the client can disconnect; the label is what lets you find the result and what dedupes a retry. For [external catalogs](./multi-catalog) and TVF reads, the synchronous `INSERT INTO ... SELECT` form covers the same ground with simpler ETL.
 
 ![Apache Doris Batch Load: Asynchronous bulk ingestion from S3, HDFS, and external catalogs into Doris via LOAD LABEL or INSERT INTO SELECT, with label-based dedup.](/images/next/key-features/batch-load.jpg)
-## Why it matters {#why}
+## Why use batch load in Apache Doris? {#why}
 
-Real-time loaders are the right tool for events trickling in, but they fall apart when you need to move a TB of historical data, or rebuild a fact table from a Hive snapshot, or backfill last quarter from Parquet on S3. A [Stream Load](./stream-load) over HTTP times out. A per-row `INSERT` would take days. The job has to run server-side, retry on its own, and survive the client losing its connection.
+Apache Doris Batch Load is the right tool when a single ingestion job runs for hours and the warehouse, not the client, has to own its lifecycle. Real-time loaders are the right tool for events trickling in, but they fall apart when you need to move a TB of historical data, or rebuild a fact table from a Hive snapshot, or backfill last quarter from Parquet on S3. A [Stream Load](./stream-load) over HTTP times out. A per-row `INSERT` would take days. The job has to run server-side, retry on its own, and survive the client losing its connection.
 
 - A nightly ETL has to land a few hundred GB of Parquet from S3 before the morning dashboards refresh.
 - A migration from Snowflake or Hive needs to copy whole partitions, with column-level filters and type casts.
 - A backfill of one missing day cannot block the connection that submitted it.
 
 Batch Load handles all three by treating large bulk ingestion as a long-running, label-tracked job that the warehouse owns end to end.
 
-## What it is {#what}
+## What is Apache Doris batch load? {#what}
 
-Batch Load is the family of asynchronous bulk-load methods in Doris. The main entry point is the `LOAD LABEL ... WITH S3|HDFS|BROKER` statement, historically called Broker Load, which today covers S3, HDFS, and any storage system reachable through a Broker process. The synchronous `INSERT INTO ... SELECT` variant covers the same ingestion needs when the source is an external catalog (Hive, Iceberg, JDBC) or a file behind a TVF such as `S3()` or `HDFS()`. Both share the same transaction model as the rest of Doris loading: one label per job, atomic commit, replica-level publish.
+Apache Doris Batch Load is the family of asynchronous bulk-load methods in Doris. The main entry point is the `LOAD LABEL ... WITH S3|HDFS|BROKER` statement, historically called Broker Load, which today covers S3, HDFS, and any storage system reachable through a Broker process. The synchronous `INSERT INTO ... SELECT` variant covers the same ingestion needs when the source is an external catalog (Hive, Iceberg, JDBC) or a file behind a TVF such as `S3()` or `HDFS()`. Both share the same transaction model as the rest of Doris loading: one label per job, atomic commit, replica-level publish.
 
 **Key terms**
 
@@ -45,7 +45,9 @@ Batch Load is the family of asynchronous bulk-load methods in Doris. The main en
 - **TVF (Table-Valued Function)**: `S3(...)`, `HDFS(...)`, `LOCAL(...)`, `iceberg_meta(...)`, and the like. They expose files or external metadata as a table you can `SELECT` from, then pipe into `INSERT INTO ... SELECT` for synchronous ingestion.
 - **`max_filter_ratio`**: the per-job tolerance for malformed rows. Default is zero, so a single bad row cancels the load.
 
-## How it works {#how}
+## How does Apache Doris batch load work? {#how}
+
+Apache Doris Batch Load registers a labeled job on the FE, parallelizes file scans across BEs, and atomically commits the result once every replica publishes the new version.
 
 1. **Submit and label.** A `LOAD LABEL my_db.daily_orders ...` request lands on the FE. The FE registers the label, opens a transaction, and the load enters `PENDING`. Resubmitting the same label inside the retention window short-circuits to the original job.
 2. **Plan and fan out.** The FE estimates file sizes (one BE handles between `min_bytes_per_broker_scanner` and `max_bytes_per_broker_scanner`, default 64 MB to 500 GB), splits the work, and dispatches scanner tasks to BEs. Job state moves through `LOADING`.
@@ -87,7 +89,9 @@ PROPERTIES ("timeout" = "3600", "max_filter_ratio" = "0.01");
 
 `SHOW LOAD ORDER BY CreateTime DESC LIMIT 1` returns when the job is `FINISHED`. Up to 1% of malformed rows are tolerated; anything more cancels the job and the URL field points at a sample of the offending rows.
 
-## When to use it {#when}
+## When should you use Apache Doris batch load? {#when}
+
+Use Apache Doris Batch Load when a single ingestion run handles tens of GB or more from object storage, an external warehouse, or a Hive-partitioned directory.
 
 **Good fit**