fix: Spark freqtable and missing values by MCBoarder289 · Pull Request #1798 · Data-Centric-AI-Community/fg-data-profiling

MCBoarder289 · 2025-12-08T20:56:20Z

This PR addresses issue #1429, as well as some of the histogram issues in #1602. When trying to resolve the missing category in the charts when Spark is used as a backend, I discovered a few bugs in the logic computing summary stats, as well as with the aggregations for value counts of not-null records.

I approached this by creating a toy dataset of both strings and numbers to profile in both Spark and Pandas and compare the reports afterwards. The toy dataset includes nulls on both fields, as well as duplicate records.

Initial Pandas Profile Output

Summary Stats:
Common Values:

Initial Spark Profile Output

Summary Stats:
Common Values:

Issues and Root Causes
There are couple of commits in here that address specific root causes of these discrepancies. Here are the summarized issues with their solutions:

Issue 1: pandas by default will count "NaN" values as Null in summary stats, but Spark SQL does not, so we explicitly address that in one of the commits.
- Resolution: This was resolved by ensuring the numeric_stats_spark() method explicitly filters out nulls and Nans to match pandas' default behavior
Issue 2: Missing values were not being properly calculated because NaN in Spark is not null, so they weren't considered missing when they should be
- Resolution: Adding nan filters to the n_missing computation in the describe_spark_counts()method
Issue 3: Histogram counts and Common Values counts using the summary["value_counts_without_nan"] Series were not correctly summing counts.
- Resolution: Adding a sum to the counts, and removing the limit(200) makes everything line up to parity with the Pandas output
- NOTE: Since we're pre-aggregating data for the value_counts, I don't think the limit(200) is necessary even with Spark. Since we're pulling this down into a Pandas Series anyway, if the data was too big, then that would explicitly fail the process instead of producing misleading reports. If you're running this in Spark on big data anyway, it's a reasonable assumption that you're using a high-memory compute instance anyway.

Fixed Spark Profile Output

Summary Stats:
Common Values:

Concluding Thoughts
While there is still some very slight variation to the computed stats because of how Spark handles nulls/NaNs differently than pandas, I think this new output is acceptably close to the pandas version and any differences are ultimately negligible. Especially when comparing the initial outputs where the differences are misleading without these fixes.

@fabclmnt - I definitely welcome any and all feedback on this approach! I'm happy to discuss further, and hope this is helpful to anyone using the Spark backend. I think this would knock out a few bug tickets overall.

In the Pandas implementation, the numeric stats like min/max/stddev/etc. by default ignore null values. This commit updates the spark implementation to more closely match that.

Need to add the isnan() check because Pandas isnull check will count NaN as null, but Spark does not

The previous calculation of counts was actually counting an already summarized dataframe, so it wasn't capturing the correct counts for each instance of a value. This is updated by summing the count value instead of performing a row count operation.

Discovered this edge case with real data, and still need to fix the rendering of an empty histogram.

MCBoarder289 · 2025-12-12T22:12:23Z

Closing in favor of #1800 which fixes multiple issues at once

MCBoarder289 added 3 commits December 8, 2025 09:17

Update spark numeric stats calculations

33c6030

In the Pandas implementation, the numeric stats like min/max/stddev/etc. by default ignore null values. This commit updates the spark implementation to more closely match that.

Update spark null checks for describe_counts_spark method

fc2547b

Need to add the isnan() check because Pandas isnull check will count NaN as null, but Spark does not

Update Spark frequency counts

b50deed

The previous calculation of counts was actually counting an already summarized dataframe, so it wasn't capturing the correct counts for each instance of a value. This is updated by summing the count value instead of performing a row count operation.

MCBoarder289 force-pushed the fix/1429-spark-freqtable-missing-vals branch from 1c1ae2c to d124bbb Compare December 8, 2025 21:02

Adding tests for issue 1429

4daf389

MCBoarder289 force-pushed the fix/1429-spark-freqtable-missing-vals branch from d124bbb to 4daf389 Compare December 10, 2025 22:07

Edge Case - completely null numeric field in Spark

83edf0a

Discovered this edge case with real data, and still need to fix the rendering of an empty histogram.

MCBoarder289 mentioned this pull request Dec 12, 2025

fix: Multiple Spark Enhancements #1800

Merged

MCBoarder289 closed this Dec 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Spark freqtable and missing values#1798

fix: Spark freqtable and missing values#1798
MCBoarder289 wants to merge 5 commits intoData-Centric-AI-Community:developfrom
MCBoarder289:fix/1429-spark-freqtable-missing-vals

MCBoarder289 commented Dec 8, 2025 •

edited

Loading

Uh oh!

MCBoarder289 commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MCBoarder289 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MCBoarder289 commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MCBoarder289 commented Dec 8, 2025 •

edited

Loading