[SPARK-56351][PYTHON][DOCS] Add dedicated documentation page for Arrow Python UDFs#55215
[SPARK-56351][PYTHON][DOCS] Add dedicated documentation page for Arrow Python UDFs#55215Yicong-Huang wants to merge 6 commits intoapache:masterfrom
Conversation
Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
|
@zhengruifeng and @gaogaotiantian would you mind taking a look whenever you find some time? |
|
|
||
| .. currentmodule:: pyspark.sql.functions | ||
|
|
||
| .. versionadded:: 4.2.0 |
There was a problem hiding this comment.
@HyukjinKwon can you verify if we should use 4.1.0 or 4.2.0 here?
There was a problem hiding this comment.
Other tutorial pages do not have this. I will remove version added annotation.
| Arrays to Array | ||
| ~~~~~~~~~~~~~~~ | ||
|
|
||
| The type hint is ``pyarrow.Array``, ... -> ``pyarrow.Array``. |
There was a problem hiding this comment.
We probable should either make this authentic Python or just use the description and example below.
There was a problem hiding this comment.
This is to align with Pandas UDF. the example below is real python code.
| # | JOHN DOE| | ||
| # +--------------+ | ||
|
|
||
| Arrow UDFs can return struct types: |
There was a problem hiding this comment.
Does "struct type" mean the same thing as "struct array" in pyarrow? I don't want to confuse our users here. Or it's talking about struct type in "returnType"?
There was a problem hiding this comment.
changed to When the returnType is a struct type, the function returns a pa.StructArray
| .. note:: | ||
|
|
||
| Native Arrow UDFs can also be defined via :func:`udf` with ``pyarrow.Array`` type hints. | ||
| Python type hints are used to detect the function type automatically. |
There was a problem hiding this comment.
We probably can be a little bit more clear about this - we are talking about the python type hints to decide what kind of arrow UDF right?
There was a problem hiding this comment.
Made it more explicit.
| Arrays to Scalar | ||
| ~~~~~~~~~~~~~~~~ | ||
|
|
||
| The type hint is ``pyarrow.Array``, ... -> ``Any``. |
There was a problem hiding this comment.
Is Any a good example here? What if I really put Any? If Any means anything here, what if I put pa.Array?
There was a problem hiding this comment.
It can take anything except the above other cases.
Added negation cases (pa.Array、Iterator、Tuple) description.
|
|
||
| The data type of returned ``pyarrow.Array`` from the user-defined functions should match the | ||
| defined ``returnType``. When there is a mismatch, Spark might do conversion on the returned data. | ||
| The conversion is not guaranteed to be correct and results should be checked for accuracy. |
There was a problem hiding this comment.
Is the flag for safe conversion public? If so, we should mention it here.
There was a problem hiding this comment.
After double check, it's not controlled by a config for Arrow UDFs. safe casting is always enabled (hardcoded in worker.py). updated the doc
|
|
||
| The user-defined functions do not support conditional expressions or short circuiting | ||
| in boolean expressions. If the functions can fail on special rows, incorporate the | ||
| condition into the functions. |
There was a problem hiding this comment.
I did not really understand this paragraph..
There was a problem hiding this comment.
Rewrote. it is about that UDF will not be short-circuit during execution. all input data (rows) will be passed into the UDF. so if the function can fail on certain input values (e.g., division by zero), users ned to handle those cases inside the function itself.
|
|
||
| .. currentmodule:: pyspark.sql.functions | ||
|
|
||
| * :func:`arrow_udf` -- Create a native Arrow UDF |
There was a problem hiding this comment.
Isn't this already covered in this tutorial? Shall we remove it?
|
@zhengruifeng @gaogaotiantian could you please check again? I want to get this in ASAP. we can fix further comments later as follow up. |
|
merged to master |
What changes were proposed in this pull request?
Add a dedicated documentation page (
arrow_python_udf.rst) for Arrow Python UDFs, covering:@arrow_udf): All 6 evaluation types — Arrays to Array, Iterator of Arrays to Iterator of Arrays, Iterator of Multiple Arrays to Iterator of Arrays, Arrays to Scalar, Iterator of Arrays to Scalar, Iterator of Multiple Arrays to Scalar.mapInArrow,groupBy().applyInArrow(), andcogroup().applyInArrow().All examples are taken from existing source code docstrings.
Why are the changes needed?
Currently, the Arrow Python UDF documentation is a small section (~30 lines) embedded within the Pandas UDF documentation page (
arrow_pandas.rst), covering only theuseArrow=Trueoptimization. The native Arrow UDFs (@arrow_udf) and Arrow Function APIs (mapInArrow,applyInArrow) have no tutorial documentation at all — only API docstrings.Given that Arrow Python UDFs are a distinct and growing feature set, they deserve a dedicated documentation page.
Does this PR introduce any user-facing change?
No. Documentation only.
How was this patch tested?
Documentation-only change. Verified that the new page is added to the toctree in
index.rst.Was this patch authored or co-authored using generative AI tooling?
No.