[SPARK-53053][PYTHON][PANDAS] Support Pandas Extension Properties by DinhLiu · Pull Request #56066 · apache/spark

DinhLiu · 2026-05-22T14:41:31Z

What changes were proposed in this pull request?

This PR implements support for Pandas Extension Properties (_constructor, _constructor_sliced, and _constructor_expanddim) in the Pandas API on Spark for both DataFrame and Series.

Specifically, it replaces the hardcoded DataFrame(...) and Series(...) instantiations inside standard operations (such as head(), _apply_series_op, to_frame, etc.) with self._constructor(...) or its dimensionality-aware counterparts.

Why are the changes needed?

Original pandas supports extension properties like _constructor that can be used to easily override what datatype is returned by default when downstream libraries inherit from pandas classes. Prior to this PR, subclassing a PySpark Pandas DataFrame or Series would break the inheritance chain during standard operations, as the methods would return the base PySpark classes instead of the subclassed ones. This change achieves better parity with standard pandas and allows developers to safely extend PySpark Pandas objects.

Does this PR introduce any user-facing change?

No for standard end-users.
Yes for developers extending the API: Developers can now subclass pyspark.pandas.DataFrame and pyspark.pandas.Series and retain their custom types after applying transformations.

How was this patch tested?

Added a new unit test test_extension_properties in python/pyspark/pandas/tests/test_extension.py to verify that operations on subclassed DataFrame and Series correctly return instances of the subclasses.

Tested locally via:
python/run-tests --testnames 'pyspark.pandas.tests.test_extension'

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Google Gemini 3.1 Pro

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds pandas-style subclassing hooks to pandas-on-Spark DataFrame/Series so that common operations preserve subclasses (via _constructor, _constructor_sliced, and _constructor_expanddim), and introduces a regression test covering the behavior.

Changes:

Add _constructor / _constructor_sliced to DataFrame, and _constructor / _constructor_expanddim to Series.
Update multiple Series/DataFrame methods to construct results via these constructors instead of hard-coding DataFrame(...).
Add a unit test that verifies head() preserves subclasses for both DataFrame and Series.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File	Description
python/pyspark/pandas/tests/test_extension.py	Adds tests for subclass-preserving behavior via constructor properties.
python/pyspark/pandas/series.py	Introduces `Series` constructor hooks and uses them in several methods to preserve subclasses.
python/pyspark/pandas/frame.py	Introduces `DataFrame` constructor hooks and updates many return paths to use them.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        from pyspark.pandas.groupby import DataFrameGroupBy

-        return DataFrameGroupBy._build(self, by, as_index=as_index, dropna=dropna)
+        return self._constructorGroupBy._build(self, by, as_index=as_index, dropna=dropna)


            raise ValueError("No available aggregation columns!")

-        return DataFrameResampler(
+        return self._constructorResampler(


+        res = first_series(self.to_frame().head(n)).rename(self.name)
+        return self._constructor(data=res)


            with self.assertRaises(AttributeError):
                ps.Series([1, 2], dtype=object).bad

+    def test_extension_properties(self):


+        # Test DataFrame extension properties
+        sub_psdf = SubclassedDataFrame(self.psdf._internal)
+        result_df = sub_psdf.head(2)
+


+        # Pass the PySpark Series directly instead of its _internal frame
+        sub_psser = SubclassedSeries(self.psdf["a"])
+        result_ser = sub_psser.head(2)
+


            with self.assertRaises(AttributeError):
                ps.Series([1, 2], dtype=object).bad

+    def test_extension_properties(self):


+
+        self.assertIsInstance(result_df, SubclassedDataFrame)
+        self.assertEqual(result_df.shape, (2, 2))
+
+        # Test Series extension properties
+        # Pass the PySpark Series directly instead of its _internal frame
+        sub_psser = SubclassedSeries(self.psdf["a"])
+        result_ser = sub_psser.head(2)
+
+        self.assertIsInstance(result_ser, SubclassedSeries)
+        self.assertEqual(len(result_ser), 2)
+



+
+        self.assertIsInstance(result_df, SubclassedDataFrame)
+        self.assertEqual(result_df.shape, (2, 2))
+
+        # Test Series extension properties
+        # Pass the PySpark Series directly instead of its _internal frame
+        sub_psser = SubclassedSeries(self.psdf["a"])
+        result_ser = sub_psser.head(2)
+
+        self.assertIsInstance(result_ser, SubclassedSeries)
+        self.assertEqual(len(result_ser), 2)
+



[SPARK-53053][PYTHON][PANDAS] Support Pandas Extension Properties

0c00d0e

Copilot AI review requested due to automatic review settings May 22, 2026 14:41

Copilot AI reviewed May 22, 2026

View reviewed changes

DinhLiu force-pushed the SPARK-53053-pandas-extension branch 3 times, most recently from 8bd0968 to cbe10c6 Compare May 22, 2026 16:35

Trigger GitHub Actions

92d5a22

DinhLiu force-pushed the SPARK-53053-pandas-extension branch from cbe10c6 to 92d5a22 Compare May 22, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-53053][PYTHON][PANDAS] Support Pandas Extension Properties#56066

[SPARK-53053][PYTHON][PANDAS] Support Pandas Extension Properties#56066
DinhLiu wants to merge 2 commits into
apache:masterfrom
DinhLiu:SPARK-53053-pandas-extension

DinhLiu commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		res = first_series(self.to_frame().head(n)).rename(self.name)
		return self._constructor(data=res)

Conversation

DinhLiu commented May 22, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants