|
| 1 | +# Schema Validation |
| 2 | + |
| 3 | +Dataframely validates a data frame against a {class}`~dataframely.Schema` by evaluating |
| 4 | +every validation rule as a native polars expression. |
| 5 | +This guide explains the validation pipeline in detail, including the different validation |
| 6 | +methods, how rules are applied, and how to introspect failures. |
| 7 | + |
| 8 | +## How validation rules are applied |
| 9 | + |
| 10 | +When a data frame is validated, dataframely collects the following rules and evaluates them |
| 11 | +in a single polars pass: |
| 12 | + |
| 13 | +1. **Column-level rules** — Automatically derived from the parameters of each column |
| 14 | + definition (e.g. `nullable`, `min_length`, `regex`, …). These rules are named |
| 15 | + `<column>|<rule>`, e.g. `zip_code|min_length` or `num_bedrooms|nullability`. |
| 16 | + |
| 17 | +2. **Primary key rule** — If any column in the schema has `primary_key=True`, dataframely |
| 18 | + automatically adds a `primary_key` rule that checks for uniqueness across all primary |
| 19 | + key columns. |
| 20 | + |
| 21 | +3. **Custom schema-level rules** — Rules defined on the schema class using the |
| 22 | + {func}`~dataframely.rule` decorator. These rules may reference any column in the schema |
| 23 | + and can also aggregate across rows when a `group_by` parameter is provided. |
| 24 | + |
| 25 | +For every row the combined result of all rules determines whether the row is valid. |
| 26 | +A row is considered valid only when every applicable rule evaluates to `True`. |
| 27 | + |
| 28 | +## Validation methods |
| 29 | + |
| 30 | +Dataframely exposes three methods for checking a data frame against a schema. |
| 31 | + |
| 32 | +### `validate` — strict validation |
| 33 | + |
| 34 | +{meth}`Schema.validate() <dataframely.Schema.validate>` raises an exception if any row |
| 35 | +fails validation: |
| 36 | + |
| 37 | +```python |
| 38 | +import polars as pl |
| 39 | +import dataframely as dy |
| 40 | + |
| 41 | + |
| 42 | +class HouseSchema(dy.Schema): |
| 43 | + zip_code = dy.String(nullable=False, min_length=3) |
| 44 | + num_bedrooms = dy.UInt8(nullable=False) |
| 45 | + num_bathrooms = dy.UInt8(nullable=False) |
| 46 | + price = dy.Float64(nullable=False) |
| 47 | + |
| 48 | + @dy.rule() |
| 49 | + def reasonable_bathroom_to_bedroom_ratio(cls) -> pl.Expr: |
| 50 | + ratio = pl.col("num_bathrooms") / pl.col("num_bedrooms") |
| 51 | + return (ratio >= 1 / 3) & (ratio <= 3) |
| 52 | + |
| 53 | + |
| 54 | +df = pl.DataFrame({ |
| 55 | + "zip_code": ["01234", "01234", "1", "213", "123", "213"], |
| 56 | + "num_bedrooms": [2, 2, 1, None, None, 2], |
| 57 | + "num_bathrooms": [1, 2, 1, 1, 0, 8], |
| 58 | + "price": [100_000, 110_000, 50_000, 80_000, 60_000, 160_000], |
| 59 | +}) |
| 60 | + |
| 61 | +validated_df: dy.DataFrame[HouseSchema] = HouseSchema.validate(df, cast=True) |
| 62 | +``` |
| 63 | + |
| 64 | +If any row is invalid, a {class}`~dataframely.ValidationError` is raised with a summary of |
| 65 | +which rules failed and how many rows were affected: |
| 66 | + |
| 67 | +``` |
| 68 | +ValidationError: 2 rules failed validation: |
| 69 | +* Column 'num_bedrooms' failed validation for 1 rules: |
| 70 | +- 'nullability' failed for 2 rows |
| 71 | +* Column 'zip_code' failed validation for 1 rules: |
| 72 | +- 'min_length' failed for 1 rows |
| 73 | +``` |
| 74 | + |
| 75 | +On success, `validate` returns a data frame with the type hint |
| 76 | +`dy.DataFrame[HouseSchema]`, indicating that the data has been validated. |
| 77 | +Columns not defined in the schema are silently dropped. |
| 78 | + |
| 79 | +### `filter` — soft validation |
| 80 | + |
| 81 | +{meth}`Schema.filter() <dataframely.Schema.filter>` never raises a |
| 82 | +{class}`~dataframely.ValidationError`. |
| 83 | +Instead, it returns a tuple of the valid rows and a |
| 84 | +{class}`~dataframely.FailureInfo` object carrying details about the rows that failed: |
| 85 | + |
| 86 | +```python |
| 87 | +valid_df, failure = HouseSchema.filter(df, cast=True) |
| 88 | +``` |
| 89 | + |
| 90 | +`valid_df` is a `dy.DataFrame[HouseSchema]` containing only the rows that passed every |
| 91 | +rule. |
| 92 | +The `failure` object lets you inspect why the remaining rows were rejected (see |
| 93 | +[Inspecting failures](#inspecting-failures) below). |
| 94 | + |
| 95 | +### `is_valid` — boolean check |
| 96 | + |
| 97 | +{meth}`Schema.is_valid() <dataframely.Schema.is_valid>` is a convenience method that |
| 98 | +returns `True` when all rows pass all rules, and `False` otherwise. |
| 99 | +It never raises a {class}`~dataframely.ValidationError`: |
| 100 | + |
| 101 | +```python |
| 102 | +if not HouseSchema.is_valid(df, cast=True): |
| 103 | + print("Data does not satisfy HouseSchema") |
| 104 | +``` |
| 105 | + |
| 106 | +## Structural errors vs. data errors |
| 107 | + |
| 108 | +Dataframely distinguishes two categories of errors: |
| 109 | + |
| 110 | +- **{class}`~dataframely.SchemaError`** — raised when the *structure* of the data frame |
| 111 | + does not match the schema, i.e., a required column is missing or a column has the wrong |
| 112 | + data type and `cast=False`. |
| 113 | + A `SchemaError` is raised by all three validation methods. |
| 114 | + |
| 115 | +- **{class}`~dataframely.ValidationError`** — raised by `validate` when the *content* |
| 116 | + of the data frame violates at least one rule. |
| 117 | + |
| 118 | +`is_valid` catches both types of errors and returns `False` instead of raising. |
| 119 | + |
| 120 | +## Type casting during validation |
| 121 | + |
| 122 | +All three methods accept a `cast` keyword argument. |
| 123 | +When `cast=True`, dataframely attempts to cast each column to the dtype defined in the |
| 124 | +schema before running the validation rules. |
| 125 | + |
| 126 | +```python |
| 127 | +df_raw = pl.DataFrame({ |
| 128 | + "zip_code": ["01234", "67890"], |
| 129 | + "num_bedrooms": [2, 3], |
| 130 | + "num_bathrooms": [1, 1], |
| 131 | + "price": ["100000", "200000"], # stored as strings |
| 132 | +}) |
| 133 | + |
| 134 | +# Cast 'price' from String to Float64 before validating |
| 135 | +validated_df = HouseSchema.validate(df_raw, cast=True) |
| 136 | +``` |
| 137 | + |
| 138 | +If a cast fails for a particular row (e.g. a string value cannot be converted to a |
| 139 | +number), that row is treated as invalid and included in the `FailureInfo` with the rule |
| 140 | +name `<column>|dtype`. |
| 141 | + |
| 142 | +## Inspecting failures |
| 143 | + |
| 144 | +The {class}`~dataframely.FailureInfo` object returned by `filter` provides several ways |
| 145 | +to understand which rows failed and why. |
| 146 | + |
| 147 | +### `counts` — per-rule failure counts |
| 148 | + |
| 149 | +```python |
| 150 | +valid_df, failure = HouseSchema.filter(df, cast=True) |
| 151 | +print(failure.counts()) |
| 152 | +``` |
| 153 | + |
| 154 | +Returns a dictionary mapping each rule that was violated to the number of rows that |
| 155 | +failed it: |
| 156 | + |
| 157 | +```python |
| 158 | +{ |
| 159 | + "reasonable_bathroom_to_bedroom_ratio": 1, |
| 160 | + "zip_code|min_length": 1, |
| 161 | + "num_bedrooms|nullability": 2, |
| 162 | +} |
| 163 | +``` |
| 164 | + |
| 165 | +### `invalid` — the failing rows |
| 166 | + |
| 167 | +```python |
| 168 | +failed_rows = failure.invalid() |
| 169 | +``` |
| 170 | + |
| 171 | +Returns a plain `pl.DataFrame` containing the original data for each row that failed |
| 172 | +at least one rule. |
| 173 | + |
| 174 | +### `details` — per-row per-rule breakdown |
| 175 | + |
| 176 | +```python |
| 177 | +details = failure.details() |
| 178 | +``` |
| 179 | + |
| 180 | +Returns the same rows as `invalid` but augmented with one additional column per rule, |
| 181 | +with values `"valid"`, `"invalid"`, or `"unknown"`: |
| 182 | + |
| 183 | +| zip_code | num_bedrooms | … | reasonable_bathroom_to_bedroom_ratio | zip_code\|min_length | num_bedrooms\|nullability | |
| 184 | +|----------|-------------|---|--------------------------------------|----------------------|--------------------------| |
| 185 | +| 1 | 1 | … | valid | invalid | valid | |
| 186 | +| 213 | null | … | valid | valid | invalid | |
| 187 | +| 213 | 2 | … | invalid | valid | valid | |
| 188 | + |
| 189 | +A value of `"unknown"` is reported when a rule could not be evaluated reliably, which |
| 190 | +can happen when `cast=True` and dtype casting fails for a value in that row. |
| 191 | + |
| 192 | +### `cooccurrence_counts` — co-occurring rule failures |
| 193 | + |
| 194 | +```python |
| 195 | +cooccurrences = failure.cooccurrence_counts() |
| 196 | +``` |
| 197 | + |
| 198 | +Returns a mapping from *sets* of rules to the number of rows where exactly those rules |
| 199 | +all failed together. |
| 200 | +This is useful for understanding whether certain rules tend to fail in combination. |
| 201 | + |
| 202 | +## Superfluous columns |
| 203 | + |
| 204 | +When validating, any column in the input data frame that is not defined in the schema is |
| 205 | +silently dropped from the output of `validate` and `filter`. |
| 206 | +This keeps the output predictable regardless of the input shape. |
0 commit comments