Skip to content

Commit 89958e7

Browse files
committed
docs: missing MASKING-TUTORIAL.md
1 parent d7381bb commit 89958e7

1 file changed

Lines changed: 184 additions & 0 deletions

File tree

MASKING-TUTORIAL.md

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Data masking with DryRun
2+
3+
## Why masking exists
4+
5+
DryRun captures PostgreSQL planner statistics so `EXPLAIN` plans are
6+
realistic without a connection to production. Those statistics include
7+
`most_common_vals` and `histogram_bounds`, which are literal column
8+
values lifted from your tables. For a `users.email` or `users.phone`
9+
column, that is PII landing in `.dryrun/history.db`.
10+
11+
The masking policy is a small YAML file that names sensitive columns.
12+
DryRun reads it at capture time and writes NULL into the offending
13+
stats before anything touches disk.
14+
15+
Masking runs once, inside `init` or `snapshot take`. The masked form is
16+
what lands in `history.db`. `push` and `pull` move bytes and do not
17+
re-mask, so a missing or wrong policy at capture is a permanent leak
18+
and recapture is the only fix.
19+
20+
By default a missing policy is a warning, so fresh checkouts work.
21+
Projects with real PII should set `require_masks = true` in
22+
`dryrun.toml` to turn that warning into a hard error.
23+
24+
The file is shared with [fixturize](https://boringsql.com/fixturize).
25+
fixturize uses the `expr` field to rewrite rows on extract. DryRun
26+
ignores `expr` and uses the column set to NULL planner stats.
27+
28+
The rest of this document shows the small case first, two columns
29+
hand-written, and then the automated path with `fixturize analyze`.
30+
31+
---
32+
33+
## The minimum viable policy
34+
35+
Suppose `users.email` and `users.phone` are the only sensitive columns
36+
you care about. Drop this file at the repo root as
37+
`data-masking-policy.yml`:
38+
39+
```yaml
40+
version: 1
41+
42+
databases:
43+
dev:
44+
columns:
45+
users.email: { expr: "'user_' || id || '@masked.test'", tags: [pii] }
46+
users.phone: { expr: "'+1-000-000-0000'", tags: [pii] }
47+
```
48+
49+
The `dev` key must match your DryRun `database_id` (see below). The
50+
`version` field must be `1`. `expr` is only read by fixturize, but it
51+
is part of the shared schema, so leave it in even if you never run
52+
fixturize.
53+
54+
Column keys can be `table.column` to match in any schema, or
55+
`schema.table.column` to qualify.
56+
57+
### Wiring it to a profile
58+
59+
DryRun resolves the policy file in three ways, highest priority first:
60+
61+
1. `--masks-file <path>` on `dryrun init`.
62+
2. `masks_file` in the active `dryrun.toml` profile.
63+
3. Auto-discovery, walking up from the working directory looking for
64+
`data-masking-policy.yml` and stopping at `.git`.
65+
66+
If the file sits at the repo root, auto-discovery is enough. To be
67+
explicit:
68+
69+
```toml
70+
require_masks = true
71+
72+
[project]
73+
id = "appdb"
74+
75+
[profiles.dev]
76+
db_url = "${DATABASE_URL}"
77+
masks_file = "data-masking-policy.yml"
78+
```
79+
80+
The profile's `database_id` defaults to the profile name (`dev`) and is
81+
what DryRun uses to select a block inside the YAML. The block name in
82+
the policy file must match it, so `databases.dev` rather than
83+
`databases.default`.
84+
85+
### Capture and verify
86+
87+
```sh
88+
dryrun --db "$DATABASE_URL" --profile dev init
89+
```
90+
91+
The summary line names how many columns got blanked:
92+
93+
```
94+
Captured schema: 24 tables, 3 views, 8 functions
95+
Schema: .dryrun/schema.json
96+
Planner: 24 tables, 41 indexes, 312 columns
97+
Masked: 2 planner-stats columns
98+
Activity: node=db-primary, 24 tables, 41 indexes
99+
```
100+
101+
Confirm it on the history store directly:
102+
103+
```sh
104+
sqlite3 .dryrun/history.db \
105+
"SELECT column_name, most_common_vals FROM planner_column_stats
106+
WHERE column_name IN ('email','phone');"
107+
```
108+
109+
Masked columns show NULL `most_common_vals`. Non-sensitive columns keep
110+
their statistics, so plans on those columns stay realistic.
111+
112+
Regardless of policy, DryRun strips `jsonb` MCV values at capture as an
113+
always-on backstop. `histogram_bounds` is not auto-stripped, so list
114+
sensitive jsonb columns in the policy explicitly.
115+
116+
For a one-off opt-out, `dryrun init --no-masks` writes raw planner
117+
stats to `history.db`. It is refused when `require_masks = true`. To
118+
select a subset of columns by tag, group them under `policies` in the
119+
YAML and pass `--mask-policy <name>`. Without it, every listed column
120+
is masked.
121+
122+
---
123+
124+
## Scaling up with `fixturize analyze`
125+
126+
Two columns are easy to type. A real schema has dozens, and the names
127+
drift over time. `fixturize analyze` scans the live schema, flags
128+
columns that look like PII by name and type, and emits a policy in the
129+
same format.
130+
131+
```sh
132+
fixturize analyze \
133+
--connection "$DATABASE_URL" \
134+
--yaml \
135+
--output data-masking-policy.yml
136+
```
137+
138+
Useful flags:
139+
140+
| Flag | Purpose |
141+
|------|---------|
142+
| `--min-confidence low\|medium\|high` | Drop low-confidence guesses. Default `low`. |
143+
| `--root users --depth 2` | Limit the scan to tables reachable from a root table. |
144+
| `--schema public` | Default schema for unqualified names. |
145+
146+
Run it once without `--yaml` to eyeball the report and confidence
147+
levels:
148+
149+
```sh
150+
fixturize analyze --connection "$DATABASE_URL" --min-confidence medium
151+
```
152+
153+
The generated file looks identical to the hand-written one above, with
154+
one caveat. `fixturize analyze` always writes the block under
155+
`databases.default`. DryRun looks up the block by your profile's
156+
`database_id` and fails with an error if it does not find a match.
157+
Rename `default` to your actual `database_id`. Multi-DB projects need
158+
one entry per captured database, even if empty.
159+
160+
Automated heuristics miss domain-specific PII like a `notes` text
161+
column or a `metadata` jsonb blob, and over-flag harmless lookups, so
162+
the output is a starting point rather than ground truth.
163+
164+
Because both tools key off `schema.table.column`, the same file
165+
protects test-data extraction in fixturize (via `expr`) and local
166+
planner-stats capture in DryRun (via the column set).
167+
168+
---
169+
170+
## Troubleshooting
171+
172+
| Symptom | Fix |
173+
|---------|-----|
174+
| `masks file has no entry for database_id "X"` | Rename the YAML block to match the profile's `database_id`. Multi-DB projects need one entry per captured database. |
175+
| `no data-masking-policy.yml resolved` warning | Expected on fresh checkouts. To enforce, set `require_masks = true` or pass `--masks-file`. |
176+
| `require_masks=true ... must exist` | Provide a file via any of the three resolution paths, or unset `require_masks`. |
177+
178+
---
179+
180+
## Related documents
181+
182+
- [`SECURITY.md`](SECURITY.md) covers DryRun's security model and where
183+
masking sits in it.
184+
- `dryrun-readonly-role.sql` is the minimum-privilege role for capture.

0 commit comments

Comments
 (0)