[SPARK-55450][SS][PYTHON][DOCS] Document admission control in PySpark streaming data sources

jiteshsoni · HeartSaVioR · commit 98cdaee3e4f9 · 2026-04-07T15:28:34.000+09:00
### What changes were proposed in this pull request? This PR adds documentation and an example for admission control in PySpark custom streaming data sources (SPARK-55304). Changes include: 1. **Updated tutorial documentation** (`python/docs/source/tutorial/sql/python_data_source.rst`): - Added "Admission Control for Streaming Readers" section - Documents `getDefaultReadLimit()` returning `ReadMaxRows(n)` to limit batch size - Shows how `latestOffset(start, limit)` respects the `ReadLimit` parameter 2. **Example file** (`examples/src/main/python/sql/streaming/structured_blockchain_admission_control.py`): - Demonstrates admission control via `getDefaultReadLimit()` and `latestOffset()` - Simulates blockchain data source with controlled batch sizes (20 blocks per batch) - Simple, focused example showing backpressure management ### Why are the changes needed? Users need documentation and practical examples to implement admission control in custom streaming sources (introduced in SPARK-55304). ### Does this PR introduce _any_ user-facing change? No. Documentation and examples only. ### How was this patch tested? **Testing approach:** - Ran the example on Databricks Dogfood Staging (DBR 17.3 / Spark 4.0) - Used the Spark Streaming UI to verify admission control works correctly **Test notebook:** [pr_54807_admission_control_notebook](https://dogfood.staging.databricks.com/editor/notebooks/1113954931051543?o=6051921418418893#command/7790625346196924) **What was verified:** 1. **Batch sizes:** Each micro-batch processed exactly 20 blocks (admission control working) 2. **Consistent behavior:** 79 batches completed in ~28 seconds, all with 20 rows 3. **Stream reader:** `PythonMicroBatchStreamWithAdmissionControl` active in Streaming UI **Sample batch output:** ```json { "batchId": 78, "numInputRows": 20, "sources": [{ "description": "PythonMicroBatchStreamWithAdmissionControl", "startOffset": {"block_number": 1560}, "endOffset": {"block_number": 1580}, "numInputRows": 20 }] } ``` ### Was this patch authored or co-authored using generative AI tooling? Yes (Claude Opus 4.5) 🤖 Generated with [Claude Code](https://claude.ai/code) Closes #54807 from jiteshsoni/SPARK-55450-admission-control-docs. Lead-authored-by: Jitesh Soni <get2jitesh@gmail.com> Co-authored-by: Canadian Data Guy <get2jitesh@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
diff --git a/examples/src/main/python/sql/streaming/structured_blockchain_admission_control.py b/examples/src/main/python/sql/streaming/structured_blockchain_admission_control.py
@@ -0,0 +1,202 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Demonstrates admission control using a simulated blockchain data source.
+
+This example shows how to build a custom streaming data source that
+simulates reading blockchain blocks while respecting admission control
+limits using getDefaultReadLimit() and ReadMaxRows(20).
+
+Key concepts demonstrated:
+- getDefaultReadLimit() returning ReadMaxRows(20) to limit blocks per micro-batch
+- latestOffset(start, limit) respecting the ReadLimit parameter
+- Controlled data ingestion rate for backpressure management
+
+Usage:
+    bin/spark-submit examples/src/main/python/sql/streaming/\\
+        structured_blockchain_admission_control.py
+
+Expected output:
+    Each micro-batch processes up to 20 blocks (controlled by admission control):
+        Batch 0: blocks 0-19
+        Batch 1: blocks 20-39
+        Batch 2: blocks 40-59
+        ...
+    The final batch may contain fewer than 20 blocks when the chain is exhausted.
+"""
+
+import hashlib
+import time
+from typing import Iterator, Sequence, Tuple
+
+from pyspark.sql import SparkSession
+from pyspark.sql.datasource import DataSource, DataSourceStreamReader, InputPartition
+from pyspark.sql.streaming.datasource import ReadAllAvailable, ReadLimit, ReadMaxRows
+from pyspark.sql.types import StructType
+
+
+class BlockPartition(InputPartition):
+    """Partition representing a range of blockchain blocks to read."""
+
+    def __init__(self, start_block: int, end_block: int):
+        self.start_block = start_block
+        self.end_block = end_block
+
+
+class BlockchainStreamReader(DataSourceStreamReader):
+    """
+    A streaming reader that simulates reading blockchain blocks.
+
+    Demonstrates admission control via getDefaultReadLimit() which limits
+    the number of blocks processed per micro-batch.
+    """
+
+    CHAIN_HEIGHT = 10000  # Total blocks available
+
+    def initialOffset(self) -> dict:
+        """Return the starting block number for new queries."""
+        return {"block_number": 0}
+
+    def getDefaultReadLimit(self) -> ReadLimit:
+        """
+        Limit each micro-batch to 20 blocks.
+
+        This controls the data ingestion rate, useful for:
+        - Preventing memory issues with large batches
+        - Rate limiting when reading from external APIs
+        - Backpressure management
+        """
+        return ReadMaxRows(20)
+
+    def latestOffset(self, start: dict, limit: ReadLimit) -> dict:
+        """
+        Compute the ending block number respecting admission control.
+
+        Parameters
+        ----------
+        start : dict
+            Current offset with 'block_number' key
+        limit : ReadLimit
+            Engine-provided limit on data consumption
+
+        Returns
+        -------
+        dict
+            Ending offset for this micro-batch
+        """
+        start_block = start["block_number"]
+
+        if start_block >= self.CHAIN_HEIGHT:
+            return start  # No more data
+
+        if isinstance(limit, ReadMaxRows):
+            end_block = min(start_block + limit.max_rows, self.CHAIN_HEIGHT)
+        elif isinstance(limit, ReadAllAvailable):
+            end_block = self.CHAIN_HEIGHT
+        else:
+            raise ValueError(f"Unexpected ReadLimit type: {type(limit)}")
+
+        return {"block_number": end_block}
+
+    def partitions(self, start: dict, end: dict) -> Sequence[InputPartition]:
+        """Create a single partition for the block range."""
+        start_block = start["block_number"]
+        end_block = end["block_number"]
+
+        if start_block >= end_block:
+            return []
+
+        return [BlockPartition(start_block, end_block)]
+
+    def read(self, partition: InputPartition) -> Iterator[Tuple]:
+        """
+        Generate simulated blockchain block data.
+
+        Each block contains:
+        - block_number: Sequential block identifier
+        - block_hash: Simulated hash based on block number
+        - timestamp: Simulated timestamp
+        - transaction_count: Simulated transaction count
+        """
+        assert isinstance(partition, BlockPartition)
+
+        for block_num in range(partition.start_block, partition.end_block):
+            block_hash = hashlib.sha256(str(block_num).encode()).hexdigest()[:16]
+            timestamp = 1700000000 + (block_num * 12)
+            tx_count = (block_num % 100) + 1
+
+            yield (block_num, block_hash, timestamp, tx_count)
+
+    def commit(self, end: dict) -> None:
+        """Cleanup after batch completion."""
+        pass
+
+
+class BlockchainDataSource(DataSource):
+    """Data source that creates BlockchainStreamReader instances."""
+
+    @classmethod
+    def name(cls) -> str:
+        return "blockchain_example"
+
+    def schema(self) -> str:
+        return "block_number INT, block_hash STRING, timestamp LONG, transaction_count INT"
+
+    def streamReader(self, schema: StructType) -> DataSourceStreamReader:
+        return BlockchainStreamReader()
+
+
+def main() -> None:
+    """Run blockchain streaming example demonstrating admission control."""
+    spark = SparkSession.builder.appName("BlockchainAdmissionControl").getOrCreate()
+
+    spark.dataSource.register(BlockchainDataSource)
+
+    print("\n" + "=" * 70)
+    print("BLOCKCHAIN STREAMING WITH ADMISSION CONTROL")
+    print("=" * 70)
+    print("\nData Source: Simulated blockchain with 10000 blocks")
+    print("Admission Control: getDefaultReadLimit() returns ReadMaxRows(20)")
+    print("Expected: Each batch processes up to 20 blocks")
+    print()
+
+    df = spark.readStream.format("blockchain_example").load()
+
+    query = (
+        df.writeStream
+        .format("console")
+        .queryName("admission_control_test")
+        .start()
+    )
+
+    print("Streaming query started. Check the Streaming UI to verify:")
+    print("- Each full batch should process 20 blocks")
+    print("- The final batch may be smaller when fewer than 20 blocks remain")
+    print()
+
+    time.sleep(30)
+
+    query.stop()
+    print("\nQuery stopped - check Streaming UI for results")
+    print("=" * 70)
+
+    spark.stop()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/python/docs/source/tutorial/sql/python_data_source.rst b/python/docs/source/tutorial/sql/python_data_source.rst
@@ -338,6 +338,56 @@ This is the same dummy streaming reader that generates 2 rows every batch implem
             """
             pass
 
+Admission Control for Streaming Readers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To limit the amount of data processed per micro-batch, implement
+``getDefaultReadLimit()`` and have ``latestOffset(start, limit)`` honor the
+engine-provided ``ReadLimit``:
+
+.. code-block:: python
+
+    from pyspark.sql.streaming.datasource import ReadAllAvailable, ReadLimit, ReadMaxRows
+
+    class MyStreamReader(DataSourceStreamReader):
+
+        def getDefaultReadLimit(self) -> ReadLimit:
+            """
+            Limit each micro-batch to at most 20 rows.
+
+            This value is just an example; in practice, configure the limit
+            based on source options (e.g., self.options.get("maxRowsPerBatch")).
+            """
+            return ReadMaxRows(20)
+
+        def latestOffset(self, start: dict, limit: ReadLimit) -> dict:
+            """
+            Return the latest offset, respecting the provided limit.
+            """
+            current = start["offset"]
+
+            if isinstance(limit, ReadMaxRows):
+                end = min(current + limit.max_rows, self.max_available)
+            elif isinstance(limit, ReadAllAvailable):
+                end = self.max_available
+            else:
+                raise ValueError(f"Unexpected ReadLimit type: {type(limit)}")
+
+            return {"offset": end}
+
+When Spark uses the default ``ReadMaxRows(20)`` limit, each micro-batch
+reads at most 20 rows, depending on the available rows. If Spark passes
+``ReadAllAvailable``, the reader should return all remaining rows instead.
+
+This is useful for:
+
+- **Controlling data ingestion rate**: Prevent overwhelming downstream systems
+- **Memory management**: Limit batch sizes to avoid out-of-memory errors
+- **Backpressure handling**: Process data at a sustainable rate
+
+For a complete working example, see:
+``examples/src/main/python/sql/streaming/structured_blockchain_admission_control.py``
+
 Implement a Streaming Writer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~