Skip to content

Commit 0f794a9

Browse files
Copilotriggggo
andcommitted
docs: add schema validation feature guide
Co-authored-by: riggggo <90137935+riggggo@users.noreply.github.com> Agent-Logs-Url: https://github.com/Quantco/dataframely/sessions/bbf5b5fb-f92b-4f77-a624-1b4b02c717b5
1 parent 718d0c2 commit 0f794a9

2 files changed

Lines changed: 207 additions & 0 deletions

File tree

docs/guides/features/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
```{toctree}
44
:maxdepth: 1
55
6+
schema-validation
67
column-metadata
78
data-generation
89
primary-keys
Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Schema Validation
2+
3+
Dataframely validates a data frame against a {class}`~dataframely.Schema` by evaluating
4+
every validation rule as a native polars expression.
5+
This guide explains the validation pipeline in detail, including the different validation
6+
methods, how rules are applied, and how to introspect failures.
7+
8+
## How validation rules are applied
9+
10+
When a data frame is validated, dataframely collects the following rules and evaluates them
11+
in a single polars pass:
12+
13+
1. **Column-level rules** — Automatically derived from the parameters of each column
14+
definition (e.g. `nullable`, `min_length`, `regex`, …). These rules are named
15+
`<column>|<rule>`, e.g. `zip_code|min_length` or `num_bedrooms|nullability`.
16+
17+
2. **Primary key rule** — If any column in the schema has `primary_key=True`, dataframely
18+
automatically adds a `primary_key` rule that checks for uniqueness across all primary
19+
key columns.
20+
21+
3. **Custom schema-level rules** — Rules defined on the schema class using the
22+
{func}`~dataframely.rule` decorator. These rules may reference any column in the schema
23+
and can also aggregate across rows when a `group_by` parameter is provided.
24+
25+
For every row the combined result of all rules determines whether the row is valid.
26+
A row is considered valid only when every applicable rule evaluates to `True`.
27+
28+
## Validation methods
29+
30+
Dataframely exposes three methods for checking a data frame against a schema.
31+
32+
### `validate` — strict validation
33+
34+
{meth}`Schema.validate() <dataframely.Schema.validate>` raises an exception if any row
35+
fails validation:
36+
37+
```python
38+
import polars as pl
39+
import dataframely as dy
40+
41+
42+
class HouseSchema(dy.Schema):
43+
zip_code = dy.String(nullable=False, min_length=3)
44+
num_bedrooms = dy.UInt8(nullable=False)
45+
num_bathrooms = dy.UInt8(nullable=False)
46+
price = dy.Float64(nullable=False)
47+
48+
@dy.rule()
49+
def reasonable_bathroom_to_bedroom_ratio(cls) -> pl.Expr:
50+
ratio = pl.col("num_bathrooms") / pl.col("num_bedrooms")
51+
return (ratio >= 1 / 3) & (ratio <= 3)
52+
53+
54+
df = pl.DataFrame({
55+
"zip_code": ["01234", "01234", "1", "213", "123", "213"],
56+
"num_bedrooms": [2, 2, 1, None, None, 2],
57+
"num_bathrooms": [1, 2, 1, 1, 0, 8],
58+
"price": [100_000, 110_000, 50_000, 80_000, 60_000, 160_000],
59+
})
60+
61+
validated_df: dy.DataFrame[HouseSchema] = HouseSchema.validate(df, cast=True)
62+
```
63+
64+
If any row is invalid, a {class}`~dataframely.ValidationError` is raised with a summary of
65+
which rules failed and how many rows were affected:
66+
67+
```
68+
ValidationError: 2 rules failed validation:
69+
* Column 'num_bedrooms' failed validation for 1 rules:
70+
- 'nullability' failed for 2 rows
71+
* Column 'zip_code' failed validation for 1 rules:
72+
- 'min_length' failed for 1 rows
73+
```
74+
75+
On success, `validate` returns a data frame with the type hint
76+
`dy.DataFrame[HouseSchema]`, indicating that the data has been validated.
77+
Columns not defined in the schema are silently dropped.
78+
79+
### `filter` — soft validation
80+
81+
{meth}`Schema.filter() <dataframely.Schema.filter>` never raises a
82+
{class}`~dataframely.ValidationError`.
83+
Instead, it returns a tuple of the valid rows and a
84+
{class}`~dataframely.FailureInfo` object carrying details about the rows that failed:
85+
86+
```python
87+
valid_df, failure = HouseSchema.filter(df, cast=True)
88+
```
89+
90+
`valid_df` is a `dy.DataFrame[HouseSchema]` containing only the rows that passed every
91+
rule.
92+
The `failure` object lets you inspect why the remaining rows were rejected (see
93+
[Inspecting failures](#inspecting-failures) below).
94+
95+
### `is_valid` — boolean check
96+
97+
{meth}`Schema.is_valid() <dataframely.Schema.is_valid>` is a convenience method that
98+
returns `True` when all rows pass all rules, and `False` otherwise.
99+
It never raises a {class}`~dataframely.ValidationError`:
100+
101+
```python
102+
if not HouseSchema.is_valid(df, cast=True):
103+
print("Data does not satisfy HouseSchema")
104+
```
105+
106+
## Structural errors vs. data errors
107+
108+
Dataframely distinguishes two categories of errors:
109+
110+
- **{class}`~dataframely.SchemaError`** — raised when the *structure* of the data frame
111+
does not match the schema, i.e., a required column is missing or a column has the wrong
112+
data type and `cast=False`.
113+
A `SchemaError` is raised by all three validation methods.
114+
115+
- **{class}`~dataframely.ValidationError`** — raised by `validate` when the *content*
116+
of the data frame violates at least one rule.
117+
118+
`is_valid` catches both types of errors and returns `False` instead of raising.
119+
120+
## Type casting during validation
121+
122+
All three methods accept a `cast` keyword argument.
123+
When `cast=True`, dataframely attempts to cast each column to the dtype defined in the
124+
schema before running the validation rules.
125+
126+
```python
127+
df_raw = pl.DataFrame({
128+
"zip_code": ["01234", "67890"],
129+
"num_bedrooms": [2, 3],
130+
"num_bathrooms": [1, 1],
131+
"price": ["100000", "200000"], # stored as strings
132+
})
133+
134+
# Cast 'price' from String to Float64 before validating
135+
validated_df = HouseSchema.validate(df_raw, cast=True)
136+
```
137+
138+
If a cast fails for a particular row (e.g. a string value cannot be converted to a
139+
number), that row is treated as invalid and included in the `FailureInfo` with the rule
140+
name `<column>|dtype`.
141+
142+
## Inspecting failures
143+
144+
The {class}`~dataframely.FailureInfo` object returned by `filter` provides several ways
145+
to understand which rows failed and why.
146+
147+
### `counts` — per-rule failure counts
148+
149+
```python
150+
valid_df, failure = HouseSchema.filter(df, cast=True)
151+
print(failure.counts())
152+
```
153+
154+
Returns a dictionary mapping each rule that was violated to the number of rows that
155+
failed it:
156+
157+
```python
158+
{
159+
"reasonable_bathroom_to_bedroom_ratio": 1,
160+
"zip_code|min_length": 1,
161+
"num_bedrooms|nullability": 2,
162+
}
163+
```
164+
165+
### `invalid` — the failing rows
166+
167+
```python
168+
failed_rows = failure.invalid()
169+
```
170+
171+
Returns a plain `pl.DataFrame` containing the original data for each row that failed
172+
at least one rule.
173+
174+
### `details` — per-row per-rule breakdown
175+
176+
```python
177+
details = failure.details()
178+
```
179+
180+
Returns the same rows as `invalid` but augmented with one additional column per rule,
181+
with values `"valid"`, `"invalid"`, or `"unknown"`:
182+
183+
| zip_code | num_bedrooms || reasonable_bathroom_to_bedroom_ratio | zip_code\|min_length | num_bedrooms\|nullability |
184+
|----------|-------------|---|--------------------------------------|----------------------|--------------------------|
185+
| 1 | 1 || valid | invalid | valid |
186+
| 213 | null || valid | valid | invalid |
187+
| 213 | 2 || invalid | valid | valid |
188+
189+
A value of `"unknown"` is reported when a rule could not be evaluated reliably, which
190+
can happen when `cast=True` and dtype casting fails for a value in that row.
191+
192+
### `cooccurrence_counts` — co-occurring rule failures
193+
194+
```python
195+
cooccurrences = failure.cooccurrence_counts()
196+
```
197+
198+
Returns a mapping from *sets* of rules to the number of rows where exactly those rules
199+
all failed together.
200+
This is useful for understanding whether certain rules tend to fail in combination.
201+
202+
## Superfluous columns
203+
204+
When validating, any column in the input data frame that is not defined in the schema is
205+
silently dropped from the output of `validate` and `filter`.
206+
This keeps the output predictable regardless of the input shape.

0 commit comments

Comments
 (0)