Skip to content

[SPARK-56351][PYTHON][DOCS] Add dedicated documentation page for Arrow Python UDFs#55215

Closed
Yicong-Huang wants to merge 6 commits intoapache:masterfrom
Yicong-Huang:SPARK-56351
Closed

[SPARK-56351][PYTHON][DOCS] Add dedicated documentation page for Arrow Python UDFs#55215
Yicong-Huang wants to merge 6 commits intoapache:masterfrom
Yicong-Huang:SPARK-56351

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add a dedicated documentation page (arrow_python_udf.rst) for Arrow Python UDFs, covering:

  • Native Arrow UDFs (@arrow_udf): All 6 evaluation types — Arrays to Array, Iterator of Arrays to Iterator of Arrays, Iterator of Multiple Arrays to Iterator of Arrays, Arrays to Scalar, Iterator of Arrays to Scalar, Iterator of Multiple Arrays to Scalar.
  • Arrow Function APIs: mapInArrow, groupBy().applyInArrow(), and cogroup().applyInArrow().

All examples are taken from existing source code docstrings.

Why are the changes needed?

Currently, the Arrow Python UDF documentation is a small section (~30 lines) embedded within the Pandas UDF documentation page (arrow_pandas.rst), covering only the useArrow=True optimization. The native Arrow UDFs (@arrow_udf) and Arrow Function APIs (mapInArrow, applyInArrow) have no tutorial documentation at all — only API docstrings.

Given that Arrow Python UDFs are a distinct and growing feature set, they deserve a dedicated documentation page.

Does this PR introduce any user-facing change?

No. Documentation only.

How was this patch tested?

Documentation-only change. Verified that the new page is added to the toctree in index.rst.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Copy Markdown
Contributor

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

Comment thread python/docs/source/tutorial/sql/arrow_python_udf.rst Outdated
Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
@HyukjinKwon
Copy link
Copy Markdown
Member

@zhengruifeng and @gaogaotiantian would you mind taking a look whenever you find some time?


.. currentmodule:: pyspark.sql.functions

.. versionadded:: 4.2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon can you verify if we should use 4.1.0 or 4.2.0 here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other tutorial pages do not have this. I will remove version added annotation.

Arrays to Array
~~~~~~~~~~~~~~~

The type hint is ``pyarrow.Array``, ... -> ``pyarrow.Array``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probable should either make this authentic Python or just use the description and example below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to align with Pandas UDF. the example below is real python code.

# | JOHN DOE|
# +--------------+

Arrow UDFs can return struct types:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "struct type" mean the same thing as "struct array" in pyarrow? I don't want to confuse our users here. Or it's talking about struct type in "returnType"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to When the returnType is a struct type, the function returns a pa.StructArray

.. note::

Native Arrow UDFs can also be defined via :func:`udf` with ``pyarrow.Array`` type hints.
Python type hints are used to detect the function type automatically.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can be a little bit more clear about this - we are talking about the python type hints to decide what kind of arrow UDF right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it more explicit.

Arrays to Scalar
~~~~~~~~~~~~~~~~

The type hint is ``pyarrow.Array``, ... -> ``Any``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Any a good example here? What if I really put Any? If Any means anything here, what if I put pa.Array?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can take anything except the above other cases.
Added negation cases (pa.Array、Iterator、Tuple) description.


The data type of returned ``pyarrow.Array`` from the user-defined functions should match the
defined ``returnType``. When there is a mismatch, Spark might do conversion on the returned data.
The conversion is not guaranteed to be correct and results should be checked for accuracy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the flag for safe conversion public? If so, we should mention it here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After double check, it's not controlled by a config for Arrow UDFs. safe casting is always enabled (hardcoded in worker.py). updated the doc


The user-defined functions do not support conditional expressions or short circuiting
in boolean expressions. If the functions can fail on special rows, incorporate the
condition into the functions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not really understand this paragraph..

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote. it is about that UDF will not be short-circuit during execution. all input data (rows) will be passed into the UDF. so if the function can fail on certain input values (e.g., division by zero), users ned to handle those cases inside the function itself.


.. currentmodule:: pyspark.sql.functions

* :func:`arrow_udf` -- Create a native Arrow UDF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already covered in this tutorial? Shall we remove it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed!

@Yicong-Huang
Copy link
Copy Markdown
Contributor Author

@zhengruifeng @gaogaotiantian could you please check again? I want to get this in ASAP. we can fix further comments later as follow up.

@zhengruifeng
Copy link
Copy Markdown
Contributor

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants