[SPARK-53053][PYTHON][PANDAS] Support Pandas Extension Properties#56066
Open
DinhLiu wants to merge 2 commits into
Open
[SPARK-53053][PYTHON][PANDAS] Support Pandas Extension Properties#56066DinhLiu wants to merge 2 commits into
DinhLiu wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds pandas-style subclassing hooks to pandas-on-Spark DataFrame/Series so that common operations preserve subclasses (via _constructor, _constructor_sliced, and _constructor_expanddim), and introduces a regression test covering the behavior.
Changes:
- Add
_constructor/_constructor_slicedtoDataFrame, and_constructor/_constructor_expanddimtoSeries. - Update multiple
Series/DataFramemethods to construct results via these constructors instead of hard-codingDataFrame(...). - Add a unit test that verifies
head()preserves subclasses for bothDataFrameandSeries.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| python/pyspark/pandas/tests/test_extension.py | Adds tests for subclass-preserving behavior via constructor properties. |
| python/pyspark/pandas/series.py | Introduces Series constructor hooks and uses them in several methods to preserve subclasses. |
| python/pyspark/pandas/frame.py | Introduces DataFrame constructor hooks and updates many return paths to use them. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from pyspark.pandas.groupby import DataFrameGroupBy | ||
|
|
||
| return DataFrameGroupBy._build(self, by, as_index=as_index, dropna=dropna) | ||
| return self._constructorGroupBy._build(self, by, as_index=as_index, dropna=dropna) |
| raise ValueError("No available aggregation columns!") | ||
|
|
||
| return DataFrameResampler( | ||
| return self._constructorResampler( |
Comment on lines
+2807
to
+2808
| res = first_series(self.to_frame().head(n)).rename(self.name) | ||
| return self._constructor(data=res) |
| with self.assertRaises(AttributeError): | ||
| ps.Series([1, 2], dtype=object).bad | ||
|
|
||
| def test_extension_properties(self): |
| # Test DataFrame extension properties | ||
| sub_psdf = SubclassedDataFrame(self.psdf._internal) | ||
| result_df = sub_psdf.head(2) | ||
|
|
| # Pass the PySpark Series directly instead of its _internal frame | ||
| sub_psser = SubclassedSeries(self.psdf["a"]) | ||
| result_ser = sub_psser.head(2) | ||
|
|
| with self.assertRaises(AttributeError): | ||
| ps.Series([1, 2], dtype=object).bad | ||
|
|
||
| def test_extension_properties(self): |
Comment on lines
164
to
176
|
|
||
| self.assertIsInstance(result_df, SubclassedDataFrame) | ||
| self.assertEqual(result_df.shape, (2, 2)) | ||
|
|
||
| # Test Series extension properties | ||
| # Pass the PySpark Series directly instead of its _internal frame | ||
| sub_psser = SubclassedSeries(self.psdf["a"]) | ||
| result_ser = sub_psser.head(2) | ||
|
|
||
| self.assertIsInstance(result_ser, SubclassedSeries) | ||
| self.assertEqual(len(result_ser), 2) | ||
|
|
||
|
|
Comment on lines
164
to
176
|
|
||
| self.assertIsInstance(result_df, SubclassedDataFrame) | ||
| self.assertEqual(result_df.shape, (2, 2)) | ||
|
|
||
| # Test Series extension properties | ||
| # Pass the PySpark Series directly instead of its _internal frame | ||
| sub_psser = SubclassedSeries(self.psdf["a"]) | ||
| result_ser = sub_psser.head(2) | ||
|
|
||
| self.assertIsInstance(result_ser, SubclassedSeries) | ||
| self.assertEqual(len(result_ser), 2) | ||
|
|
||
|
|
8bd0968 to
cbe10c6
Compare
cbe10c6 to
92d5a22
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR implements support for Pandas Extension Properties (
_constructor,_constructor_sliced, and_constructor_expanddim) in the Pandas API on Spark for bothDataFrameandSeries.Specifically, it replaces the hardcoded
DataFrame(...)andSeries(...)instantiations inside standard operations (such ashead(),_apply_series_op,to_frame, etc.) withself._constructor(...)or its dimensionality-aware counterparts.Why are the changes needed?
Original pandas supports extension properties like
_constructorthat can be used to easily override what datatype is returned by default when downstream libraries inherit from pandas classes. Prior to this PR, subclassing a PySpark PandasDataFrameorSerieswould break the inheritance chain during standard operations, as the methods would return the base PySpark classes instead of the subclassed ones. This change achieves better parity with standard pandas and allows developers to safely extend PySpark Pandas objects.Does this PR introduce any user-facing change?
No for standard end-users.
Yes for developers extending the API: Developers can now subclass
pyspark.pandas.DataFrameandpyspark.pandas.Seriesand retain their custom types after applying transformations.How was this patch tested?
Added a new unit test
test_extension_propertiesinpython/pyspark/pandas/tests/test_extension.pyto verify that operations on subclassedDataFrameandSeriescorrectly return instances of the subclasses.Tested locally via:
python/run-tests --testnames 'pyspark.pandas.tests.test_extension'Was this patch authored or co-authored using generative AI tooling?
Generated-by: Google Gemini 3.1 Pro