|
1 | | -## Updating data |
| 1 | +# Contributing to policyengine-uk-data |
2 | 2 |
|
3 | | -If your changes present a non-bugfix change to one or more datasets which are cloud-hosted (FRS and EFRS), then please change both the filename and URL (in both the class definition file and in `storage/upload_completed_datasets.py`). This enables us to store historical versions of datasets separately and reproducibly. |
| 3 | +See the [shared PolicyEngine contribution guide](https://github.com/PolicyEngine/.github/blob/main/CONTRIBUTING.md) for cross-repo conventions (towncrier changelog fragments, `uv run`, PR description format, anti-patterns). This file covers policyengine-uk-data specifics. |
4 | 4 |
|
5 | | -## Updating the versioning |
| 5 | +## Commands |
6 | 6 |
|
7 | | -Please add to `changelog.yaml` and then run `make changelog` before committing the results ONCE in this PR. |
| 7 | +```bash |
| 8 | +make install # install deps (uv) |
| 9 | +make format # format (required) |
| 10 | +make download # download raw FRS + SPI inputs from HF (needs HUGGING_FACE_TOKEN) |
| 11 | +make data # full dataset build (impute, calibrate, upload) |
| 12 | +make test # test suite |
| 13 | +uv run pytest policyengine_uk_data/tests/path/to/test.py -v |
| 14 | +``` |
| 15 | + |
| 16 | +Python 3.13+. Default branch: `main`. Raw FRS / SPI microdata live on HuggingFace; set `HUGGING_FACE_TOKEN` before running anything that touches the dataset build. |
| 17 | + |
| 18 | +## What lives here |
| 19 | + |
| 20 | +This repo builds the `.h5` files that feed `policyengine-uk`: |
| 21 | + |
| 22 | +- `datasets/frs.py` — raw FRS → PolicyEngine variable mapping |
| 23 | +- `datasets/imputations/` — QRF / other imputations layered on top (income, wealth, consumption, etc.) |
| 24 | +- `datasets/local_areas/` — constituency and local-authority calibration |
| 25 | +- `targets/` — calibration target sources (OBR, DWP, HMRC, ONS, SLC, etc.) |
| 26 | +- `utils/calibrate.py` — the reweighting optimiser |
| 27 | +- `storage/` — raw inputs, intermediate artefacts, published outputs |
| 28 | + |
| 29 | +## Data-protection rules — no exceptions |
| 30 | + |
| 31 | +The enhanced FRS dataset is licensed under strict UK Data Service terms. Violating them risks losing access, which would end PolicyEngine UK. |
| 32 | + |
| 33 | +- **Never upload data to any public location.** The HuggingFace repo `policyengine/policyengine-uk-data-private` is private and authenticated. |
| 34 | +- **Never modify `upload_completed_datasets.py` or `utils/data_upload.py`** to change upload destinations without explicit confirmation from the data controller (currently Nikhil Woodruff). |
| 35 | +- **Never print, log, or output individual-level records.** Aggregates (sums, means, counts, weighted totals) are fine; individual rows are not. |
| 36 | +- **If you see a private/public repo split, assume it is intentional** — ask why before changing it. |
| 37 | + |
| 38 | +## Updating datasets |
| 39 | + |
| 40 | +If your change is a non-bugfix update to a cloud-hosted dataset (FRS, enhanced FRS), bump both the filename and URL in the class definition and in `storage/upload_completed_datasets.py`. That lets us store historical dataset versions separately and reproducibly. |
| 41 | + |
| 42 | +## Repo-specific anti-patterns |
| 43 | + |
| 44 | +- **Don't** hardcode dataset years in variable transforms; use `dataset.time_period` and the uprating pipeline. |
| 45 | +- **Don't** commit large binary artefacts — use HuggingFace storage. |
| 46 | +- **Don't** skip `make test` when touching the imputation or calibration pipeline; full CI rebuilds the dataset and takes ~25 minutes. |
0 commit comments