pointblank/docs/user-guide/preprocessing.qmd at 2dc6c8b692065e85bd691023149d238ecefa7158 · posit-dev/pointblank · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
---
title: Preprocessing
jupyter: python3
toc-expand: 2
html-table-processing: none
bread-crumbs: true
---

```{python}
#| echo: false
#| output: false
import pointblank as pb
pb.config(report_incl_footer=False)
```

While the available validation methods can do a lot for you, there's likewise a lot of things you
*can't* easily do with them. What if you wanted to validate that

- string lengths in a column are less than 10 characters?
- the median of values in a column is less than the median of values in another column?
- there are at least three instances of every categorical value in a column?

These constitute more sophisticated validation requirements, yet such examinations are quite
prevalent in practice. Rather than expanding our library to encompass every conceivable validation
scenario (a pursuit that would yield an unwieldy and potentially infinite collection) we instead
employ a more elegant approach. By transforming the table under examination through judicious
preprocessing and exposing key metrics, we may subsequently employ the existing collection of
validation methods. This compositional strategy affords us considerable analytical power while
maintaining conceptual clarity and implementation parsimony.

Central to this approach is the idea of composability. Pointblank makes it easy to safely transform
the target table for a given validation via the `pre=` argument. Any computed columns are available
for the (short) lifetime of the validation step during interrogation. This composability means:

1. we can validate on different forms of the initial dataset (e.g., validating on
aggregate forms, validating on calculated columns, etc.)
2. there's no need to start an entirely new validation process for each transformed version of the
data (i.e., one tabular report could be produced instead of several)

This compositional paradigm allows us to use data transformation effectively within our validation
workflows, maintaining both flexibility and clarity in our data quality assessments.

## Transforming Data with Lambda Functions

Now, through examples, let's look at the process of performing the validations mentioned above.
We'll use the `small_table` dataset for all of the examples. Here it is in its entirety:

```{python}
#| echo: false
pb.preview(pb.load_dataset(dataset="small_table", tbl_type="polars"), n_head=20, n_tail=20)
```

In getting to grips with the basics, we'll try to validate that string lengths in the `b` column are
less than 10 characters. We can't directly use the `~~Validate.col_vals_lt()` validation method with
that column because it is meant to be used with a column of numeric values. Let's just give that
method what it needs and create a column with string lengths!

The target table is a Polars DataFrame so we'll provide a lambda function that uses the Polars API
to add in that numeric column:

```{python}
import polars as pl

(
    pb.Validate(
        data=pb.load_dataset(dataset="small_table", tbl_type="polars"),
        tbl_name="small_table",
        label="String lengths"
    )
    .col_vals_lt(

        # The generated column, via `pre=` (see below) ---
        columns="string_lengths",

        # The string length value to be less than ---
        value=10,

        # Polars expression that modifies the table ---
        pre=lambda df: df.with_columns(string_lengths=pl.col("b").str.len_chars())
    )
    .interrogate()
)
```

The validation was successfully constructed and we can see from the validation report table that all
strings in `b` had lengths less than 10 characters. Also note that the icon under the `TBL` column
is no longer a rightward-facing arrow, but one that is indicative of a transformation taking place.

Let's examine the transformation approach more closely. In the previous example, we're not directly
testing the `b` column itself. Instead, we're validating the `string_lengths` column that was
generated by the lambda function provided to `pre=`. The Polars API's `with_columns()` method does
the heavy lifting, creating numerical values that represent each string's length in the original
column.

That transformation occurs only during interrogation and only for that validation step. Any prior or
subsequent steps would normally use the as-provided `small_table`. Having the possibility for
data transformation being isolated at the step level means that you don't have to generate separate
validation plans for each form of the data, you're free to fluidly transform the target table as
necessary for perform validations on different representations of the data.

## Using Custom Functions for Preprocessing

While lambda functions work well for simple transformations, custom named functions can make your
validation code more organized and reusable, especially for complex preprocessing logic. Let's
implement the same string length validation using a dedicated function:

```{python}
def add_string_lengths(df):
    # This generates string length from a column `b`; the new column with
    # the values is called `string_lengths` (will be placed as the last column)
    return df.with_columns(string_lengths=pl.col("b").str.len_chars())

(
    pb.Validate(
        data=pb.load_dataset(dataset="small_table", tbl_type="polars"),
        tbl_name="small_table",
        label="String lengths for column `b`."
    )
    .col_vals_lt(

        # Use of a column selector function to select the last column ---
        columns=pb.last_n(1),

        # The string length to be less than ---
        value=10,

        # Custom function for generating string lengths in a new column ---
        pre=add_string_lengths
    )
    .interrogate()
)
```

The column-generating logic was placed in the `add_string_lengths()` function, which is then passed
to `pre=`. Notice we're using `pb.last_n(1)` in the `columns` parameter. This is a convenient column
selector that targets the last column in the DataFrame, which in our case is the newly created
`string_lengths` column. This saves us from having to explicitly write out the column name, making
our code more adaptable if column names change. Despite not specifying the name directly, you'll
still see the actual column name (`string_lengths`) displayed in the validation report.

## Creating Parameterized Preprocessing Functions

So far we've used simple functions and lambdas, but sometimes you may want to create more flexible
preprocessing functions that can be configured with parameters. Let's create a reusable function
that can calculate string lengths for any column:

```{python}
def string_length_calculator(column_name):
    """Returns a preprocessing function that calculates string lengths for the specified column."""
    def preprocessor(df):
        return df.with_columns(string_lengths=pl.col(column_name).str.len_chars())
    return preprocessor

# Validate string lengths in column b
(
    pb.Validate(
        data=pb.load_dataset(dataset="small_table", tbl_type="polars"),
        tbl_name="small_table",
        label="String lengths for column `b`."
    )
    .col_vals_lt(
        columns=pb.last_n(1),
        value=10,
        pre=string_length_calculator(column_name="b")
    )
    .interrogate()
)
```

This pattern is called a *function factory*, which is a function that creates and returns another
function. The outer function (`string_length_calculator()`) accepts parameters that customize the
behavior of the returned preprocessing function. The inner function (`preprocessor()`) is what
actually gets called during validation.

This approach offers several benefits as it:

- creates reusable, configurable preprocessing functions
- keeps your validation code DRY
- allows you to separate configuration from implementation
- enables easy application of the same transformation to different columns

You could extend this pattern to create even more sophisticated preprocessing functions with
multiple parameters, default values, and complex logic.

## Using Narwhals to Preprocess Many Types of DataFrames

In this previous example we used a Polars table. You might have a situation where you perform data
validation variously on Pandas and Polars DataFrames. This is where Narwhals becomes handy: it
provides a single, consistent API that works across multiple DataFrame types, eliminating the need
to learn and switch between different APIs depending on your data source.

Let's obtain `small_table` as a Pandas DataFrame. We'll construct a validation step to verify that
the median of column `c` is greater than the median in column `a`.

```{python}
import narwhals as nw

# Define preprocessing function using Narwhals for cross-backend compatibility
def get_median_columns_c_and_a(df):
    return nw.from_native(df).select(nw.median("c"), nw.median("a"))

(
    pb.Validate(
        data=pb.load_dataset(dataset="small_table", tbl_type="pandas"),
        tbl_name="small_table",
        label="Median comparison.",
    )
    .col_vals_gt(
        columns="c",
        value=pb.col("a"),

        # Custom function to modify the table; generates table with columns `c` and `a` ---
        pre=get_median_columns_c_and_a
    )
    .interrogate()
)
```

The goal is to check that the median value of `c` is greater than the corresponding median of
column `a`, which we set up through the `columns=` and `value=` parameters in the
`~~Validate.col_vals_gt()` method.

There's a bit to unpack here so let's look at at the lambda function first. Narwhals can translate
a Pandas DataFrame to a Narwhals DataFrame with its `from_native()` function. After that initiating
step, you're free to use the Narwhals API (which is modeled on a subset of the Polars API) to do the
necessary data transformation. In this case, we are getting the medians of the `c` and `a` columns
and ending up with a one-row, two-column table.

We should note that the transformed table is, perhaps surprisingly, a Narwhals DataFrame (we didn't
have to go back to a Pandas DataFrame by using `.to_native()`). Pointblank is able to work directly
with the Narwhals DataFrame for validation purposes, which makes the workflow more concise.

One more thing to note: Pointblank provides a convenient syntactic sugar for working with Narwhals.
If you name the lambda parameter `dfn` instead of `df`, the system automatically applies
`nw.from_native()` to the input DataFrame first. This lets you write more concise code without
having to explicitly convert the DataFrame to a Narwhals format.

## Swapping in a Totally Different DataFrame

Sometimes data validation requires looking at completely transformed versions of your data (such as
aggregated summaries, pivoted views, or even reference tables). While this approach goes against the
typical paradigm of validating a single *target table*, there are legitimate use cases where you
might need to validate properties that only emerge after significant transformations.

Let's now try to prepare the final validation scenario, checking that there are at least three
instances of every categorical value in column `f` (which contains string values in the set of
`"low"`, `"mid"`, and `"high"`). This time, we'll prepare the transformed table (transformed by
Polars expressions) outside of the Pointblank code.

```{python}
data_original = pb.load_dataset(dataset="small_table", tbl_type="polars")
data_transformed = data_original.group_by("f").len(name="n")

data_transformed
```

Then, we'll plug in the `data_transformed` DataFrame with a preprocessing function:

```{python}
# Define preprocessing function to use the transformed data
def use_transformed_data(df):
    return data_transformed

(
    pb.Validate(
        data=data_original,
        tbl_name="small_table",
        label="Category counts.",
    )
    .col_vals_ge(
        columns="n",
        value=3,
        pre=use_transformed_data
    )
    .interrogate()
)
```

We can see from the validation report table that there are three test units. This corresponds to a
row for each of the categorical value counts. From the report, we find that two of the three test
units are passing test units (turns out there are only two instances of `"mid"` in column `f`).

Note that the swapped-in table can be any table type that Pointblank supports, like a Polars
DataFrame (as shown here), a Pandas DataFrame, a Narwhals DataFrame, or any other compatible format.
This flexibility allows you to validate properties of your data that might only be apparent after
significant reshaping or aggregation.

## Conclusion

The preprocessing capabilities in Pointblank provide the power and flexibility for validating
complex data properties beyond what's directly possible with the standard validation methods.
Through the `pre=` parameter, you can:

- transform your data on-the-fly with computed columns
- generate aggregated metrics to validate statistical properties
- work seamlessly across different DataFrame types using Narwhals
- swap in completely different tables when validating properties that emerge only after
transformation

By combining these preprocessing techniques with Pointblank's validation methods, you can create
comprehensive data quality checks that address virtually any validation scenario without needing an
endless library of specialized validation functions. This composable approach keeps your validation
code concise while allowing you to verify even the most complex data quality requirements.

Remember that preprocessing happens just for the specific validation step, keeping your validation
plan organized and maintaining the integrity of your original data throughout the rest of the
validation process.