feat: Improved summary stats report for string datatype columns#1104
Merged
mwojtyczka merged 12 commits intodatabrickslabs:mainfrom Apr 23, 2026
Merged
Conversation
Signed-off-by: Roshan1299 <banisettirosh@gmail.com>
1 task
ghanse
requested changes
Apr 4, 2026
Collaborator
ghanse
left a comment
There was a problem hiding this comment.
This looks good to me. Can you add a few small tests to validate that distinct counts are computed and that min and max are not computed for string type columns?
You should be able reuse the existing patterns in integration/test_profiler.py.
Signed-off-by: Roshan1299 <banisettirosh@gmail.com>
Contributor
Author
|
@ghanse Added Integration Tests |
mwojtyczka
requested changes
Apr 15, 2026
Contributor
mwojtyczka
left a comment
There was a problem hiding this comment.
LGTM - please fix the performance issue as suggested
mwojtyczka
approved these changes
Apr 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
minandmaxmetrics for string columns in summary stats, since lexicographic min/max values are not meaningful for text datacount_distinctmetric for all column typesLinked issues
Resolves #670
Tests
Both changes are in private methods (
_process_metric,_profile) that are only testable through the publicprofile()API, which requires a live SparkSession. The existing integration tests intest_profiler.pyexercise these code paths and will validate in CI. Happy to add a dedicated integration test if needed.make fmt— pylint 10/10, mypy clean)make test— 939 passed)