Skip to content

[SPARK-53053][PYTHON][PANDAS] Support Pandas Extension Properties#56066

Open
DinhLiu wants to merge 2 commits into
apache:masterfrom
DinhLiu:SPARK-53053-pandas-extension
Open

[SPARK-53053][PYTHON][PANDAS] Support Pandas Extension Properties#56066
DinhLiu wants to merge 2 commits into
apache:masterfrom
DinhLiu:SPARK-53053-pandas-extension

Conversation

@DinhLiu
Copy link
Copy Markdown

@DinhLiu DinhLiu commented May 22, 2026

What changes were proposed in this pull request?

This PR implements support for Pandas Extension Properties (_constructor, _constructor_sliced, and _constructor_expanddim) in the Pandas API on Spark for both DataFrame and Series.

Specifically, it replaces the hardcoded DataFrame(...) and Series(...) instantiations inside standard operations (such as head(), _apply_series_op, to_frame, etc.) with self._constructor(...) or its dimensionality-aware counterparts.

Why are the changes needed?

Original pandas supports extension properties like _constructor that can be used to easily override what datatype is returned by default when downstream libraries inherit from pandas classes. Prior to this PR, subclassing a PySpark Pandas DataFrame or Series would break the inheritance chain during standard operations, as the methods would return the base PySpark classes instead of the subclassed ones. This change achieves better parity with standard pandas and allows developers to safely extend PySpark Pandas objects.

Does this PR introduce any user-facing change?

No for standard end-users.
Yes for developers extending the API: Developers can now subclass pyspark.pandas.DataFrame and pyspark.pandas.Series and retain their custom types after applying transformations.

How was this patch tested?

Added a new unit test test_extension_properties in python/pyspark/pandas/tests/test_extension.py to verify that operations on subclassed DataFrame and Series correctly return instances of the subclasses.

Tested locally via:
python/run-tests --testnames 'pyspark.pandas.tests.test_extension'

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Google Gemini 3.1 Pro

Copilot AI review requested due to automatic review settings May 22, 2026 14:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds pandas-style subclassing hooks to pandas-on-Spark DataFrame/Series so that common operations preserve subclasses (via _constructor, _constructor_sliced, and _constructor_expanddim), and introduces a regression test covering the behavior.

Changes:

  • Add _constructor / _constructor_sliced to DataFrame, and _constructor / _constructor_expanddim to Series.
  • Update multiple Series/DataFrame methods to construct results via these constructors instead of hard-coding DataFrame(...).
  • Add a unit test that verifies head() preserves subclasses for both DataFrame and Series.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File Description
python/pyspark/pandas/tests/test_extension.py Adds tests for subclass-preserving behavior via constructor properties.
python/pyspark/pandas/series.py Introduces Series constructor hooks and uses them in several methods to preserve subclasses.
python/pyspark/pandas/frame.py Introduces DataFrame constructor hooks and updates many return paths to use them.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/pyspark/pandas/frame.py Outdated
from pyspark.pandas.groupby import DataFrameGroupBy

return DataFrameGroupBy._build(self, by, as_index=as_index, dropna=dropna)
return self._constructorGroupBy._build(self, by, as_index=as_index, dropna=dropna)
Comment thread python/pyspark/pandas/frame.py Outdated
raise ValueError("No available aggregation columns!")

return DataFrameResampler(
return self._constructorResampler(
Comment on lines +2807 to +2808
res = first_series(self.to_frame().head(n)).rename(self.name)
return self._constructor(data=res)
with self.assertRaises(AttributeError):
ps.Series([1, 2], dtype=object).bad

def test_extension_properties(self):
# Test DataFrame extension properties
sub_psdf = SubclassedDataFrame(self.psdf._internal)
result_df = sub_psdf.head(2)

# Pass the PySpark Series directly instead of its _internal frame
sub_psser = SubclassedSeries(self.psdf["a"])
result_ser = sub_psser.head(2)

with self.assertRaises(AttributeError):
ps.Series([1, 2], dtype=object).bad

def test_extension_properties(self):
Comment on lines 164 to 176

self.assertIsInstance(result_df, SubclassedDataFrame)
self.assertEqual(result_df.shape, (2, 2))

# Test Series extension properties
# Pass the PySpark Series directly instead of its _internal frame
sub_psser = SubclassedSeries(self.psdf["a"])
result_ser = sub_psser.head(2)

self.assertIsInstance(result_ser, SubclassedSeries)
self.assertEqual(len(result_ser), 2)


Comment on lines 164 to 176

self.assertIsInstance(result_df, SubclassedDataFrame)
self.assertEqual(result_df.shape, (2, 2))

# Test Series extension properties
# Pass the PySpark Series directly instead of its _internal frame
sub_psser = SubclassedSeries(self.psdf["a"])
result_ser = sub_psser.head(2)

self.assertIsInstance(result_ser, SubclassedSeries)
self.assertEqual(len(result_ser), 2)


@DinhLiu DinhLiu force-pushed the SPARK-53053-pandas-extension branch 3 times, most recently from 8bd0968 to cbe10c6 Compare May 22, 2026 16:35
@DinhLiu DinhLiu force-pushed the SPARK-53053-pandas-extension branch from cbe10c6 to 92d5a22 Compare May 22, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants