Skip to content

fix: Make unique work for nested types and improve performance#333

Merged
Oliver Borchert (borchero) merged 2 commits intomainfrom
refactor-unique
Apr 22, 2026
Merged

fix: Make unique work for nested types and improve performance#333
Oliver Borchert (borchero) merged 2 commits intomainfrom
refactor-unique

Conversation

@borchero
Copy link
Copy Markdown
Member

@borchero Oliver Borchert (borchero) commented Apr 21, 2026

Motivation

Follow-up to #325, see #325 (comment).

Changes

  • Move unique to the column definitions as it is column- rather than schema-business
  • Remove the unique_columns classmethod with the same argument
  • Use pl.col("...").is_unique() by default instead of wrapping in a struct for improved performance
  • Add tests for nested uniqueness and uniqueness of complex types

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (7126c1e) to head (518dab2).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #333   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           56        56           
  Lines         3408      3406    -2     
=========================================
- Hits          3408      3406    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors how unique constraints are expressed and validated by moving uniqueness checks into per-column validation rules, enabling unique to work for nested column types (e.g., List/Array inner types) and improving performance for scalar columns.

Changes:

  • Implement unique as part of Column.validation_rules() (and remove schema-level unique_columns() plumbing).
  • Use expr.is_unique() for most columns, with a Polars workaround for List/Array (pl.struct(expr).is_unique()).
  • Add/adjust tests for unique on List/Array columns and their inner types; remove the test for the removed unique_columns() method.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/schema/test_validate.py Removes test coverage for deleted unique_columns() API.
tests/column_types/test_list.py Adds unique and inner_unique tests for List.
tests/column_types/test_array.py Adds unique and inner_unique tests for Array.
dataframely/columns/_base.py Adds unique rule to base column validation rules.
dataframely/_base_schema.py Removes schema-level unique column rule construction and unique_columns() method.
dataframely/columns/list.py Implements unique for list columns using struct wrapper workaround.
dataframely/columns/array.py Implements unique for array columns using struct wrapper workaround.
dataframely/columns/struct.py Updates unique parameter documentation.
dataframely/columns/string.py Updates unique parameter documentation.
dataframely/columns/integer.py Updates unique parameter documentation.
dataframely/columns/float.py Updates unique parameter documentation.
dataframely/columns/enum.py Updates unique parameter documentation.
dataframely/columns/decimal.py Updates unique parameter documentation.
dataframely/columns/datetime.py Updates unique parameter documentation.
dataframely/columns/categorical.py Updates unique parameter documentation.

@borchero Oliver Borchert (borchero) merged commit b993da6 into main Apr 22, 2026
35 of 36 checks passed
@borchero Oliver Borchert (borchero) deleted the refactor-unique branch April 22, 2026 08:02
@gab23r
Copy link
Copy Markdown
Contributor

gab23r commented Apr 22, 2026

Indeed much cleaner than my implementation !
Just for fun. Wapping in_unique in pl.struct is actually faster on micro benchmark 😄

import dataframely as dy
import polars as pl


class Schema(dy.Schema):
    a = dy.Int64()
df = Schema.sample(1000000)

%timeit _ = df.select(pl.struct("a").is_unique())
# 8.08 ms ± 495 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit _ = df.select(pl.col("a").is_unique())
# 28.6 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is because pl.struct("a").is_unique() dispatch to df.select("a").unnest().is_unique() which is exectuted in parallel. Which is not the case for Expr.is_unique.

@borchero
Copy link
Copy Markdown
Member Author

Thanks for the benchmark gab23r, this is really surprising 👀 but good to know that this behavior changed since I last benchmarked this >1y ago 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants