fix: Spark freqtable and missing values#1798
Closed
MCBoarder289 wants to merge 5 commits intoData-Centric-AI-Community:developfrom
Closed
fix: Spark freqtable and missing values#1798MCBoarder289 wants to merge 5 commits intoData-Centric-AI-Community:developfrom
MCBoarder289 wants to merge 5 commits intoData-Centric-AI-Community:developfrom
Conversation
In the Pandas implementation, the numeric stats like min/max/stddev/etc. by default ignore null values. This commit updates the spark implementation to more closely match that.
Need to add the isnan() check because Pandas isnull check will count NaN as null, but Spark does not
The previous calculation of counts was actually counting an already summarized dataframe, so it wasn't capturing the correct counts for each instance of a value. This is updated by summing the count value instead of performing a row count operation.
1c1ae2c to
d124bbb
Compare
d124bbb to
4daf389
Compare
Discovered this edge case with real data, and still need to fix the rendering of an empty histogram.
Contributor
Author
|
Closing in favor of #1800 which fixes multiple issues at once |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR addresses issue #1429, as well as some of the histogram issues in #1602. When trying to resolve the missing category in the charts when Spark is used as a backend, I discovered a few bugs in the logic computing summary stats, as well as with the aggregations for value counts of not-null records.
I approached this by creating a toy dataset of both strings and numbers to profile in both Spark and Pandas and compare the reports afterwards. The toy dataset includes nulls on both fields, as well as duplicate records.
Initial Pandas Profile Output
Initial Spark Profile Output
Summary Stats:

Common Values:

Issues and Root Causes
There are couple of commits in here that address specific root causes of these discrepancies. Here are the summarized issues with their solutions:
Issue 1: pandas by default will count "NaN" values as Null in summary stats, but Spark SQL does not, so we explicitly address that in one of the commits.
numeric_stats_spark()method explicitly filters out nulls and Nans to match pandas' default behaviorIssue 2: Missing values were not being properly calculated because NaN in Spark is not null, so they weren't considered missing when they should be
describe_spark_counts()methodIssue 3: Histogram counts and Common Values counts using the
summary["value_counts_without_nan"]Series were not correctly summing counts.limit(200)makes everything line up to parity with the Pandas outputFixed Spark Profile Output
Summary Stats:

Common Values:

Concluding Thoughts
While there is still some very slight variation to the computed stats because of how Spark handles nulls/NaNs differently than pandas, I think this new output is acceptably close to the pandas version and any differences are ultimately negligible. Especially when comparing the initial outputs where the differences are misleading without these fixes.
@fabclmnt - I definitely welcome any and all feedback on this approach! I'm happy to discuss further, and hope this is helpful to anyone using the Spark backend. I think this would knock out a few bug tickets overall.