fix: Make unique work for nested types and improve performance#333
fix: Make unique work for nested types and improve performance#333Oliver Borchert (borchero) merged 2 commits intomainfrom
unique work for nested types and improve performance#333Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #333 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 56 56
Lines 3408 3406 -2
=========================================
- Hits 3408 3406 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR refactors how unique constraints are expressed and validated by moving uniqueness checks into per-column validation rules, enabling unique to work for nested column types (e.g., List/Array inner types) and improving performance for scalar columns.
Changes:
- Implement
uniqueas part ofColumn.validation_rules()(and remove schema-levelunique_columns()plumbing). - Use
expr.is_unique()for most columns, with a Polars workaround forList/Array(pl.struct(expr).is_unique()). - Add/adjust tests for
uniqueonList/Arraycolumns and their inner types; remove the test for the removedunique_columns()method.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/schema/test_validate.py | Removes test coverage for deleted unique_columns() API. |
| tests/column_types/test_list.py | Adds unique and inner_unique tests for List. |
| tests/column_types/test_array.py | Adds unique and inner_unique tests for Array. |
| dataframely/columns/_base.py | Adds unique rule to base column validation rules. |
| dataframely/_base_schema.py | Removes schema-level unique column rule construction and unique_columns() method. |
| dataframely/columns/list.py | Implements unique for list columns using struct wrapper workaround. |
| dataframely/columns/array.py | Implements unique for array columns using struct wrapper workaround. |
| dataframely/columns/struct.py | Updates unique parameter documentation. |
| dataframely/columns/string.py | Updates unique parameter documentation. |
| dataframely/columns/integer.py | Updates unique parameter documentation. |
| dataframely/columns/float.py | Updates unique parameter documentation. |
| dataframely/columns/enum.py | Updates unique parameter documentation. |
| dataframely/columns/decimal.py | Updates unique parameter documentation. |
| dataframely/columns/datetime.py | Updates unique parameter documentation. |
| dataframely/columns/categorical.py | Updates unique parameter documentation. |
|
Indeed much cleaner than my implementation ! import dataframely as dy
import polars as pl
class Schema(dy.Schema):
a = dy.Int64()
df = Schema.sample(1000000)
%timeit _ = df.select(pl.struct("a").is_unique())
# 8.08 ms ± 495 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit _ = df.select(pl.col("a").is_unique())
# 28.6 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)This is because |
|
Thanks for the benchmark gab23r, this is really surprising 👀 but good to know that this behavior changed since I last benchmarked this >1y ago 😅 |
Motivation
Follow-up to #325, see #325 (comment).
Changes
uniqueto the column definitions as it is column- rather than schema-businessunique_columnsclassmethod with the same argumentpl.col("...").is_unique()by default instead of wrapping in a struct for improved performance