You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Improved summary stats report for string datatype columns (#1104)
## Changes
<!-- Summary of your changes that are easy to understand. Add
screenshots when necessary -->
- Nulled out `min` and `max` metrics for string columns in summary
stats, since lexicographic min/max values are not meaningful for text
data
- Added `count_distinct` metric for all column types
### Linked issues
<!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes,
fixed, resolve, resolves, resolved. See
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
-->
Resolves#670
### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->
- [ ] manually tested
- [ ] added unit tests
- [x] added integration tests
- [ ] added end-to-end tests
- [ ] added performance tests
Both changes are in private methods (`_process_metric`, `_profile`) that
are only testable through the public `profile()` API, which requires a
live SparkSession. The existing integration tests in `test_profiler.py`
exercise these code paths and will validate in CI. Happy to add a
dedicated integration test if needed.
- [x] Formatting passes (`make fmt` — pylint 10/10, mypy clean)
- [x] Unit tests pass (`make test` — 939 passed)
---------
Signed-off-by: Roshan1299 <banisettirosh@gmail.com>
Co-authored-by: Marcin Wojtyczka <marcin.wojtyczka@databricks.com>
0 commit comments