Skip to content

Commit 4de3bb5

Browse files
docs: explain how schema validation works with Dataframely
Co-authored-by: PeterKretschmerQC <102249383+PeterKretschmerQC@users.noreply.github.com> Agent-Logs-Url: https://github.com/Quantco/dataframely/sessions/8961276f-2a08-4fea-a5f6-796c9df5923e
1 parent b6d8fb6 commit 4de3bb5

2 files changed

Lines changed: 156 additions & 0 deletions

File tree

docs/guides/features/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
```{toctree}
44
:maxdepth: 1
55
6+
schema-validation
67
column-metadata
78
data-generation
89
primary-keys
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Schema Validation
2+
3+
A {class}`~dataframely.Schema` class specifies the expected structure and content of a polars DataFrame.
4+
It defines:
5+
6+
- **Columns**: the expected column names, data types, and per-column constraints
7+
- **Rules**: additional row-level or group-level validation expressions
8+
9+
## Columns and column-level rules
10+
11+
Each column in a schema is declared by assigning a {class}`~dataframely.Column` instance to a class attribute:
12+
13+
```python
14+
import dataframely as dy
15+
16+
17+
class UserSchema(dy.Schema):
18+
id = dy.String(primary_key=True)
19+
age = dy.UInt8(nullable=False)
20+
email = dy.String(nullable=True)
21+
```
22+
23+
When validating a DataFrame against this schema, dataframely verifies that:
24+
25+
1. **All expected columns are present** with the correct data types.
26+
2. **Column-level constraints** hold for every row. Common constraints include:
27+
- `nullable=False` (the default): the column must not contain null values.
28+
- `primary_key=True`: values in this column (or combination of columns) must be unique.
29+
See [Primary Keys](primary-keys.md) for details.
30+
- Type-specific constraints, e.g., `min_length`/`max_length`/`regex` for {class}`~dataframely.String`
31+
or `min`/`max` for numeric types.
32+
33+
```{note}
34+
Each column type exposes its own set of constraints. Refer to the
35+
{doc}`API reference </api/columns/index>` for a full list.
36+
```
37+
38+
### The `check` parameter
39+
40+
For one-off constraints that do not have a dedicated parameter, every column type accepts a `check`
41+
argument. It receives a polars expression and must return a boolean expression:
42+
43+
```python
44+
class SalarySchema(dy.Schema):
45+
# Only allow salaries that are a multiple of 500.
46+
salary = dy.Float64(nullable=False, check=lambda col: col % 500 == 0)
47+
```
48+
49+
Multiple checks can be provided as a list or a dictionary:
50+
51+
```python
52+
class SalarySchema(dy.Schema):
53+
salary = dy.Float64(
54+
nullable=False,
55+
check={
56+
"multiple_of_500": lambda col: col % 500 == 0,
57+
"at_least_minimum_wage": lambda col: col >= 1_000,
58+
},
59+
)
60+
```
61+
62+
## Schema-level validation rules
63+
64+
Column-level constraints only validate a single column in isolation. When you need to express
65+
constraints that span **multiple columns** or depend on **aggregated values**, use the
66+
{func}`@dy.rule() <dataframely.rule>` decorator:
67+
68+
```python
69+
import polars as pl
70+
import dataframely as dy
71+
72+
73+
class InvoiceSchema(dy.Schema):
74+
admission_date = dy.Date(nullable=False)
75+
discharge_date = dy.Date(nullable=False)
76+
amount = dy.Float64(nullable=False)
77+
78+
@dy.rule()
79+
def discharge_after_admission(cls) -> pl.Expr:
80+
return pl.col("discharge_date") >= pl.col("admission_date")
81+
```
82+
83+
The decorated method receives the schema class as its first argument and must return a polars
84+
`Expr` that evaluates to a **boolean value for every row**. A row is considered valid when the
85+
expression evaluates to `True`.
86+
87+
```{tip}
88+
You can reference a column by its name (e.g. `pl.col("discharge_date")`) or through the schema
89+
attribute (e.g. `InvoiceSchema.discharge_date.col`). The latter is refactoring-safe and allows
90+
IDEs to provide auto-completion.
91+
```
92+
93+
### Group rules
94+
95+
Rules can also be defined on **groups of rows** by passing a `group_by` argument to
96+
{func}`@dy.rule() <dataframely.rule>`. The expression is then evaluated per group and must return
97+
an **aggregated boolean** (one value per group):
98+
99+
```python
100+
class HouseSchema(dy.Schema):
101+
zip_code = dy.String(nullable=False)
102+
price = dy.Float64(nullable=False)
103+
104+
@dy.rule(group_by=["zip_code"])
105+
def minimum_zip_code_count(cls) -> pl.Expr:
106+
# Require at least two houses per zip code.
107+
return pl.len() >= 2
108+
```
109+
110+
All rows belonging to a group that fails a group rule are marked as invalid.
111+
112+
## Schema inheritance
113+
114+
Schemas can be extended through standard Python inheritance. The child schema inherits all columns
115+
and rules from its parent:
116+
117+
```python
118+
class BaseSchema(dy.Schema):
119+
id = dy.String(primary_key=True)
120+
created_at = dy.Datetime(nullable=False)
121+
122+
123+
class UserSchema(BaseSchema):
124+
name = dy.String(nullable=False)
125+
email = dy.String(nullable=True)
126+
```
127+
128+
`UserSchema.column_names()` returns `["id", "created_at", "name", "email"]`. Inheritance can be
129+
arbitrarily deep and supports multiple inheritance, provided that the same column name is not
130+
defined differently in more than one branch.
131+
132+
## Inspecting a schema
133+
134+
You can inspect a schema by printing it or calling `repr()` on it. This shows all columns together
135+
with their constraints and any custom validation rules:
136+
137+
```python
138+
>>> print(InvoiceSchema)
139+
[Schema "InvoiceSchema"]
140+
Columns:
141+
- "admission_date": Date()
142+
- "discharge_date": Date()
143+
- "amount": Float64()
144+
Rules:
145+
- "discharge_after_admission": [(col("discharge_date")) >= (col("admission_date"))]
146+
```
147+
148+
## Validating data
149+
150+
Once a schema is defined, use {meth}`Schema.validate() <dataframely.Schema.validate>` to check a
151+
DataFrame and raise an error on any violation, or {meth}`Schema.filter() <dataframely.Schema.filter>`
152+
for a "soft" validation that returns both the valid rows and a {class}`~dataframely.FailureInfo`
153+
object describing which rows failed and why.
154+
155+
See the [Quickstart](../quickstart.md) for a step-by-step walkthrough.

0 commit comments

Comments
 (0)