Skip to content

feat: Improved summary stats report for string datatype columns#1104

Merged
mwojtyczka merged 12 commits intodatabrickslabs:mainfrom
Roshan1299:feat/670-improve-string-summary-stats
Apr 23, 2026
Merged

feat: Improved summary stats report for string datatype columns#1104
mwojtyczka merged 12 commits intodatabrickslabs:mainfrom
Roshan1299:feat/670-improve-string-summary-stats

Conversation

@Roshan1299
Copy link
Copy Markdown
Contributor

@Roshan1299 Roshan1299 commented Apr 1, 2026

Changes

  • Nulled out min and max metrics for string columns in summary stats, since lexicographic min/max values are not meaningful for text data
  • Added count_distinct metric for all column types

Linked issues

Resolves #670

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

Both changes are in private methods (_process_metric, _profile) that are only testable through the public profile() API, which requires a live SparkSession. The existing integration tests in test_profiler.py exercise these code paths and will validate in CI. Happy to add a dedicated integration test if needed.

  • Formatting passes (make fmt — pylint 10/10, mypy clean)
  • Unit tests pass (make test — 939 passed)

Signed-off-by: Roshan1299 <banisettirosh@gmail.com>
@Roshan1299 Roshan1299 requested a review from a team as a code owner April 1, 2026 20:20
@Roshan1299 Roshan1299 requested review from tombonfert and removed request for a team April 1, 2026 20:20
@ghanse ghanse self-requested a review April 4, 2026 16:06
Copy link
Copy Markdown
Collaborator

@ghanse ghanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Can you add a few small tests to validate that distinct counts are computed and that min and max are not computed for string type columns?

You should be able reuse the existing patterns in integration/test_profiler.py.

@ghanse ghanse added the under-review This PR is currently being reviewed by one of DQX maintainers. label Apr 4, 2026
Signed-off-by: Roshan1299 <banisettirosh@gmail.com>
@Roshan1299
Copy link
Copy Markdown
Contributor Author

@ghanse Added Integration Tests

@Roshan1299 Roshan1299 requested a review from ghanse April 5, 2026 06:00
Copy link
Copy Markdown
Collaborator

@ghanse ghanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ghanse ghanse added Approved to Merge When PR is reviewed and approved. To be merged once all tests pass and removed under-review This PR is currently being reviewed by one of DQX maintainers. labels Apr 6, 2026
@mwojtyczka mwojtyczka changed the title feat: improve summary stats report for string datatype columns feat: Improved summary stats report for string datatype columns Apr 15, 2026
Copy link
Copy Markdown
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - please fix the performance issue as suggested

Comment thread src/databricks/labs/dqx/profiler/profiler.py Outdated
Copy link
Copy Markdown
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - i run int and e2e tests from local

@mwojtyczka mwojtyczka merged commit ed9a28b into databrickslabs:main Apr 23, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved to Merge When PR is reviewed and approved. To be merged once all tests pass

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Improve summary stats report for string datatype columns

3 participants