[SPARK-56351][PYTHON][DOCS] Add dedicated documentation page for Arrow Python UDFs by Yicong-Huang · Pull Request #55215 · apache/spark

Yicong-Huang · 2026-04-06T05:54:25Z

What changes were proposed in this pull request?

Add a dedicated documentation page (arrow_python_udf.rst) for Arrow Python UDFs, covering:

Native Arrow UDFs (@arrow_udf): All 6 evaluation types — Arrays to Array, Iterator of Arrays to Iterator of Arrays, Iterator of Multiple Arrays to Iterator of Arrays, Arrays to Scalar, Iterator of Arrays to Scalar, Iterator of Multiple Arrays to Scalar.
Arrow Function APIs: mapInArrow, groupBy().applyInArrow(), and cogroup().applyInArrow().

All examples are taken from existing source code docstrings.

Why are the changes needed?

Currently, the Arrow Python UDF documentation is a small section (~30 lines) embedded within the Pandas UDF documentation page (arrow_pandas.rst), covering only the useArrow=True optimization. The native Arrow UDFs (@arrow_udf) and Arrow Function APIs (mapInArrow, applyInArrow) have no tutorial documentation at all — only API docstrings.

Given that Arrow Python UDFs are a distinct and growing feature set, they deserve a dedicated documentation page.

Does this PR introduce any user-facing change?

No. Documentation only.

How was this patch tested?

Documentation-only change. Verified that the new page is added to the toctree in index.rst.

Was this patch authored or co-authored using generative AI tooling?

No.

devin-petersohn

Looks great.

Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>

HyukjinKwon · 2026-04-06T22:32:31Z

@zhengruifeng and @gaogaotiantian would you mind taking a look whenever you find some time?

gaogaotiantian · 2026-04-06T23:03:26Z

+
+.. currentmodule:: pyspark.sql.functions
+
+.. versionadded:: 4.2.0


Is this true?

@HyukjinKwon can you verify if we should use 4.1.0 or 4.2.0 here?

Other tutorial pages do not have this. I will remove version added annotation.

gaogaotiantian · 2026-04-06T23:04:46Z

+Arrays to Array
+~~~~~~~~~~~~~~~
+
+The type hint is ``pyarrow.Array``, ... -> ``pyarrow.Array``.


We probable should either make this authentic Python or just use the description and example below.

This is to align with Pandas UDF. the example below is real python code.

gaogaotiantian · 2026-04-06T23:07:36Z

+    # |      JOHN DOE|
+    # +--------------+
+
+Arrow UDFs can return struct types:


Does "struct type" mean the same thing as "struct array" in pyarrow? I don't want to confuse our users here. Or it's talking about struct type in "returnType"?

changed to When the returnType is a struct type, the function returns a pa.StructArray

gaogaotiantian · 2026-04-06T23:10:28Z

+.. note::
+
+    Native Arrow UDFs can also be defined via :func:`udf` with ``pyarrow.Array`` type hints.
+    Python type hints are used to detect the function type automatically.


We probably can be a little bit more clear about this - we are talking about the python type hints to decide what kind of arrow UDF right?

Made it more explicit.

gaogaotiantian · 2026-04-06T23:15:00Z

+Arrays to Scalar
+~~~~~~~~~~~~~~~~
+
+The type hint is ``pyarrow.Array``, ... -> ``Any``.


Is Any a good example here? What if I really put Any? If Any means anything here, what if I put pa.Array?

It can take anything except the above other cases.
Added negation cases (pa.Array、Iterator、Tuple) description.

gaogaotiantian · 2026-04-06T23:17:46Z

+
+The data type of returned ``pyarrow.Array`` from the user-defined functions should match the
+defined ``returnType``. When there is a mismatch, Spark might do conversion on the returned data.
+The conversion is not guaranteed to be correct and results should be checked for accuracy.


Is the flag for safe conversion public? If so, we should mention it here.

After double check, it's not controlled by a config for Arrow UDFs. safe casting is always enabled (hardcoded in worker.py). updated the doc

gaogaotiantian · 2026-04-06T23:18:15Z

+
+The user-defined functions do not support conditional expressions or short circuiting
+in boolean expressions. If the functions can fail on special rows, incorporate the
+condition into the functions.


I did not really understand this paragraph..

Rewrote. it is about that UDF will not be short-circuit during execution. all input data (rows) will be passed into the UDF. so if the function can fail on certain input values (e.g., division by zero), users ned to handle those cases inside the function itself.

allisonwang-db · 2026-04-08T18:05:10Z

+
+.. currentmodule:: pyspark.sql.functions
+
+* :func:`arrow_udf` -- Create a native Arrow UDF


Isn't this already covered in this tutorial? Shall we remove it?

Yicong-Huang · 2026-04-14T21:32:30Z

@zhengruifeng @gaogaotiantian could you please check again? I want to get this in ASAP. we can fix further comments later as follow up.

zhengruifeng · 2026-04-15T02:00:54Z

merged to master

Yicong-Huang added 3 commits April 6, 2026 05:53

docs: add dedicated documentation page for Arrow Python UDFs

1452912

docs: add missing imports to make code blocks self-contained

f533966

docs: fix RST heading levels for consistent hierarchy

5f54f14

devin-petersohn approved these changes Apr 6, 2026

View reviewed changes

Yicong-Huang commented Apr 6, 2026

View reviewed changes

Comment thread python/docs/source/tutorial/sql/arrow_python_udf.rst Outdated

Apply suggestions from code review

c0f23d2

Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>

gaogaotiantian reviewed Apr 6, 2026

View reviewed changes

allisonwang-db reviewed Apr 8, 2026

View reviewed changes

Yicong-Huang added 2 commits April 12, 2026 03:50

docs: address review comments on Arrow UDF documentation

d0cb42b

docs: address second round of review comments

1821b89

zhengruifeng approved these changes Apr 15, 2026

View reviewed changes

zhengruifeng closed this in da37644 Apr 15, 2026


		.. currentmodule:: pyspark.sql.functions

		.. versionadded:: 4.2.0


		.. currentmodule:: pyspark.sql.functions

		* :func:`arrow_udf` -- Create a native Arrow UDF

Conversation

Yicong-Huang commented Apr 6, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon commented Apr 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang commented Apr 14, 2026

Uh oh!

zhengruifeng commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants