Skip to content

Commit 5484a6f

Browse files
MaxGhenisclaude
andauthored
Add TRACE case study and harden reproducibility artifacts (#315)
* Add TRACE case study writeup for AEA / TRACE grant team Working draft describing the PolicyEngine use case for TRACE, prepared after the 2026-04-21 meeting with Lars Vilhuber, Tara Watson, John Sabelhaus, Tim Clark, and Casper. Structured around the reframe that emerged in the meeting: TRACE should wrap PolicyEngine infrastructure (the us-data build pipeline and policyengine.org webapp runs) rather than be embedded in the end-user Python package. Covers: - Which PolicyEngine surfaces warrant institutional certification - The precise claims a TRO lets us make (rules, data, reform, inputs, outputs including per-household frame, institutional attestation) - UK data as the strongest TRACE case for us - Three concrete implementation workstreams with linked issues - What TRACE gets from us as a case study (infrastructure-certifying vs author-certifying; microdata provenance; pe:* extension discipline) - Three open questions (per-household frame default, retention and durable addressing, signing and key rotation) Lars explicitly asked for this kind of writeup during the meeting to feed the TRACE grant proposal and vocabulary design work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Apply codex review to TRACE case study: soften UK, add non-scenarios and adjacent work, clarify institution-backed self-attestation Codex review of the 2026-04-21 meeting transcript vs. this writeup flagged four issues: 1. UK was oversold as 'the strongest' or 'only' TRACE case. Transcript supports 'a strong case' but not 'the strongest' — and we are considering a recalibrated UK variant that would partly lift the restriction anyway. 2. Missing explicit non-scenario section. The meeting was emphatic that researcher-laptop TRO emission, transitive dep tracing, and plain version-identification are NOT TRACE's job for us. 3. Missing adjacent workstreams that came up but are not TRACE-solved: preservation-grade archiving (HuggingFace vs Zenodo), PolicyEngine- specific TRACE vocabulary contribution, and non-TRACE version- identification work (Casper's point). 4. 'Institutional certification' language oversold what PolicyEngine actually provides. An institution certifying its own runs 'carries technically no difference' from an author certifying their own runs; the value comes from institutional reputation and structured evidence, not from cryptographic equivalent of arms-length independence. Also: back off the per-household frame as 'the highest-value downstream artifact' claim the transcript doesn't support; flag it as open design question. Drop 'transitive Python deps' from the rules-bundle section per transcript explicitly saying TRACE has not built that in. Add three additional open-question items (retention + preservation, key trust model, production-runtime binding) surfaced by codex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Soften assertive language in TRACE case study per codex review Codex flagged that the writeup slipped from 'claims we want to make' into 'claims we can make now' for service-account signatures, durable URLs, and per-household frames — three things the transcript does not actually settle. Changes: - Reframe section title from 'The precise claims a PolicyEngine TRO lets us make' to 'The claims a PolicyEngine TRO should let us make'. Every present-tense claim about what a TRO 'lets us' do is softened to what a TRO 'would let us' do, conditional on the design questions still being settled. - Per-household frame: drop the 'for US runs the TRO includes the full frame' assertion; replace with explicit open-design-question framing. Cite the transcript exchange for traceability. - Signing mechanism: remove the claim that a service-account signature is the answer. List service-account + DNS-keychain + Sigstore as options under consideration. - Institutional-attestation claim gains a caveat that the service- account signature is 'one implementation, not the only one.' - Workstream list for policyengine-api#3485 is rewritten from 'signed by a PolicyEngine service account, persisted to GCS with durable URL' (asserts design decisions that have not been made) to explicitly naming the strawman and the alternatives. - The two workstreams the writeup describes gain an explicit live / not-yet-live marker: us-data build TRO emission is live (us-data#746 shipped); webapp-run emission + Cite UI is not (api#3485, app#2830, api#3486). The open-questions section already handled this correctly; this change aligns the main body with that section so the writeup is internally consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Harden TRACE reproducibility artifacts * Format TRACE reproducibility changes * Update US model surface snapshot --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 4096b5c commit 5484a6f

18 files changed

Lines changed: 481 additions & 105 deletions

File tree

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Added `docs/trace-case-study.md`, a working draft describing the PolicyEngine TRACE use case for Lars Vilhuber (AEA Data Editor) and the TRACE project team. Covers which PolicyEngine surfaces warrant institutional certification, the precise claims a TRO lets us make, UK data as the strongest case, the three concrete workstreams (us-data build TROs, policyengine-api webapp-run TROs, policyengine-app "Cite this result" UI), and open questions we want feedback on.

docs/trace-case-study.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# PolicyEngine as a TRACE case study
2+
3+
_Working draft, April 2026 — prepared after a 2026-04-21 meeting with Lars Vilhuber (AEA Data Editor), Tara Watson (Brookings), John Sabelhaus, Tim Clark, and Casper (TRACE project)._
4+
5+
## What TRACE is for, in the PolicyEngine case
6+
7+
TRACE (Transparent Research And Citation Exchange) defines a standards-based vocabulary — TROv 0.1 at `https://w3id.org/trace/trov/0.1#` — for documenting analytical artifacts by content hash under a SHACL-validatable JSON-LD grammar. A Transparent Research Object (TRO) binds inputs, code, and outputs in a way that a reader who cannot re-run the analysis can still verify that a specific set of files produced a specific set of results.
8+
9+
The question we walked into the meeting with was: where in the PolicyEngine stack does TRACE add real value?
10+
11+
The answer we walked out with is narrower and cleaner than what we had been building toward. TRACE is not a feature of the `policyengine` Python package for researchers running simulations on their own hardware. For that use case, readers who want to check a paper's numbers can just `pip install` the same pins and rerun. TRACE in that loop is documentation, not credibility.
12+
13+
TRACE matters in exactly the places where the reader cannot easily re-run the analysis:
14+
15+
1. **The calibrated microdata build.** Each `enhanced_cps_YYYY.h5` that we publish to Hugging Face is derived from inputs that the public cannot all access directly (IRS-PUF requires agreeing to IRS's terms of use; the build itself takes hours on Modal with specific GPU configurations). Each release emits a TRO that binds the upstream input fingerprints, the build code, and the output h5 under canonical TROv 0.1. **This is live today** — us-data PR #746 shipped the emission — though cross-linking from the Hugging Face dataset card is still in flight.
16+
17+
2. **Simulation runs through policyengine.org.** When a researcher uses the webapp to score a reform, we run the simulation on our infrastructure against our pinned calibrated data and return the result. A paper that cites that result is asking its readers to trust PolicyEngine's institutional attestation — not to trust that the researcher reproduced a Python pipeline faithfully on their own laptop. A TRO signed by PolicyEngine and served from our infrastructure would make that institutional attestation explicit and machine-verifiable. **This is not yet live** — backend emission is scoped in policyengine-api#3485, the "Cite this result" UI in policyengine-app#2830, both blocked on a pe.py v4 migration (api#3486, draft in #3487). This document describes the intended shape of the workflow, not its current state.
18+
19+
## The claims a PolicyEngine TRO should let us make
20+
21+
Before TRACE, a paper citing a PolicyEngine result could say: "PolicyEngine-US computed an EITC expansion impact of $X using `policyengine-us==1.653.3` and `policyengine-us-data==1.85.2`." The reader had to take it on faith that those versions, run on that reform, actually produced $X — or install the pins and try it themselves, which presumes the researcher's environment was not modified.
22+
23+
A TRO emitted by policyengine.org would let the paper cite a URL instead. That URL would resolve to a JSON-LD document the reader can validate with a stock tool. The artifact set we are designing toward, pinned by SHA-256:
24+
25+
- The **rules bundle**: wheel hashes for `policyengine` and `policyengine-us` at the version resolved at run time. (We do not pin transitive Python dependencies inside the TRO — TRACE has explicitly not built that in, and a verifier who wants to reconstruct the full environment can resolve the declared dependencies against a public index.)
26+
- The **calibrated microdata**: the `enhanced_cps_2024.h5` SHA-256 and the `DataReleaseManifest` that describes how it was built.
27+
- The **reform**: the full reform JSON submitted by the user, content-hashed.
28+
- The **inputs**: for a household-level simulation, the household JSON the user entered; for an economy-wide simulation, the configuration payload.
29+
- The **outputs**: a content-hashed `results.json` carrying the aggregate metrics the webapp displays. Whether to *also* bind a full per-household weighted simulation frame is an open design question (see below) — it would enable downstream custom splits without re-running the simulation, at a file-size and privacy-posture cost that varies by country.
30+
- The **institutional attestation**: CI/deploy run URL, git SHA, cloud region, timestamp, and a cryptographic signature. The signing mechanism is not yet settled (see open questions); options under consideration include a GCP workload-identity short-lived signature, a published keychain rooted in a DNS TXT record at policyengine.org, or a Sigstore-style transparency log.
31+
32+
Claims we believe such a TRO *should* support, in plain language:
33+
34+
1. _These were the rules, this was the calibrated microdata, and these were the inputs that produced those outputs._ — This is the artifact-composition claim; TROv core supports it.
35+
2. _PolicyEngine as an institution ran this simulation; the researcher did not modify the code between our servers and their paper._ — This requires the institutional-attestation design to be nailed down. The service-account signature we envision is one implementation; it is not the only one.
36+
3. _Any future reader can recover the full per-household counterfactual frame for re-analysis, bounded only by what we legally can redistribute._ — This depends on the per-household-frame default-or-opt-in design question below.
37+
38+
The per-household frame question deserves a specific flag: whether the webapp TRO binds the full per-household counterfactual frame by default, or only on request, is unsettled. Papers cite aggregates; reviewers and follow-up work want distributions, state-level breakdowns, variables the paper did not headline; but an always-default full frame has file-size and privacy-posture costs, especially in restricted-data countries. We intend to make the trade-off deliberately rather than defaulting to either extreme. Transcript note: this came up in the meeting (Sabelhaus on what the microdata contains beyond the summary, Max on whether the full frame belongs in a TRO); no consensus on "default-on" emerged.
39+
40+
One framing point worth being careful about: what PolicyEngine provides is *institution-backed self-attestation*, not arms-length third-party certification. The arms-length property — that the verifier of a claim is structurally independent of the party being audited — is genuinely absent when PolicyEngine both runs the simulation and signs the TRO. What the TRO buys in that case is structured evidence that a reader (or a reviewer) can query, backed by institutional reputation, not cryptographic independence. That is a real step up from "trust me, I ran it" — but we should not market it as more than it is.
41+
42+
## UK data as a strong case for TRACE
43+
44+
In our US work the underlying calibrated h5 is already public on Hugging Face, so a local rerun is in principle possible. That weakens the TRACE value proposition on US — a reader motivated enough to verify could just `pip install` the pins and try it themselves. The TRO still buys institutional attestation (the researcher did not modify the code), but re-running is not materially blocked.
45+
46+
In our UK work the underlying microdata is UK Data Service–licensed and cannot be redistributed. A researcher who wants to verify a UK PolicyEngine result cannot re-run it on their own machine on any reasonable timescale, because they cannot acquire the inputs easily. Institutional attestation is a particularly strong credibility path here, which is why the meeting flagged this kind of scenario as where TRACE adds the most value.
47+
48+
One caveat worth naming explicitly: we are considering publishing a re-calibrated UK variant derived entirely from public-use inputs, which would partially lift the restriction. If that lands, the US and UK cases converge again. And the TRACE project's own plans for external-identifier pinning (UKDS study number + checksum, IRS-PUF agreement number + checksum) — not yet firmed up in TROv at time of writing — would provide an even cleaner mechanism for binding restricted-input provenance without redistribution.
49+
50+
## What is explicitly NOT a TRACE case for us
51+
52+
It is worth being equally clear about where TRACE does *not* add value for PolicyEngine, so we do not accidentally scope it there:
53+
54+
- **A researcher running `policyengine.py` locally and emitting their own TRO.** Readers can `pip install` the same pins and rerun themselves. A TRO is bookkeeping, not a credibility upgrade. The TRO emission helpers in `policyengine.py` exist because they are reused by the two cases above, not because local emission is the flagship user experience.
55+
- **Tracing transitive Python dependencies.** TRACE has, per the meeting, explicitly not built this in, and we should not either. The code documents its declared dependencies; a verifier can resolve them against a public index.
56+
- **Anything that replaces plain version-and-vintage identification.** Much of what matters for reproducibility is just showing "they used that file with that version." That is documentation, not TRACE — and it is often enough on its own, especially for researchers running the Python package against public-use inputs.
57+
58+
## Adjacent workstreams TRACE does not cover
59+
60+
Several reproducibility commitments came up in the meeting that are TRACE-adjacent rather than TRACE-solved. Flagging them so they do not get lost:
61+
62+
- **Preservation-grade archiving.** Hugging Face, where our calibrated h5 artifacts are hosted today, does not publish a preservation commitment comparable to Zenodo or a CLOCKSS / LOCKSS participant. For a TRO citation URL to be durable decades from now, the artifacts it pins need to live somewhere with an explicit long-term preservation policy. Zenodo as a secondary / mirror target is worth serious consideration.
63+
- **PolicyEngine-specific TRACE vocabulary contribution.** We already use `pe:*` extension fields; as we implement and find patterns that generalize (e.g., institution-backed self-attestation, microdata-build provenance, infrastructure-run attestation), contributing those upstream to TROv vocabulary design is in scope.
64+
- **Plain version-identification work outside TRACE.** Version badges, shareable permalinks that resolve to the same numbers, a "why did this number move?" diff view between release pairs. These are separate deliverables that are on our app roadmap; TRACE is not the right frame for them.
65+
66+
Both external-identifier pinning and OS / compute-environment capture are on the TRACE roadmap and would help when they land. We will adopt as they ship.
67+
68+
## What PolicyEngine is building in response
69+
70+
Three concrete workstreams, each tracked as a GitHub issue:
71+
72+
- **`policyengine-us-data`**: each `enhanced_cps_YYYY.h5` release already emits a build TRO. We will verify these TROs are published alongside the h5 and cross-linked from the Hugging Face dataset card so they are discoverable. (us-data PR #746 shipped the emission; issue #808 addresses a parallel licensing-documentation correction.)
73+
- **`policyengine-api`**: emit a TRACE TRO for every webapp simulation run. The exact signing mechanism and persistence store are open design questions — service-account + GCS is the current strawman, but a Zenodo / Sigstore / DNS-rooted-keychain alternative is under consideration, especially for long-term durability. (Issue #3485; prerequisite v4 migration in #3487.)
74+
- **`policyengine-app`**: surface the TRO as a "Cite this result" action with a citation download panel, an always-visible rules-vs-data version badge so the "rules changed or data changed?" question is answerable at a glance, and shareable permalinks that resolve the same numbers forever. (Issue #2830, blocked on the api work.)
75+
76+
Documentation for researchers is being updated (household-api-docs PR #7) to put the webapp-run citation flow ahead of the local-Python-CLI flow, matching the framing that emerged in the meeting.
77+
78+
## What TRACE gets from us as a case study
79+
80+
A few things we think are worth surfacing to the TRACE project directly:
81+
82+
1. **A use case that is infrastructure-certifying, not author-certifying.** The canonical TRACE scenario is a researcher bundling their code and data. Ours is a web service signing runs on behalf of researchers. The distinction matters for how institutional attestation gets represented in the vocabulary and for what SHACL shapes reject.
83+
2. **Microdata provenance as a first-class artifact class.** Our build pipeline takes hours on specialized hardware and draws on half a dozen upstream sources with varying access levels. The TROv concept of `ArtifactComposition` handles this well, but concrete experience with a working microsimulation build may be useful input as the vocabulary evolves.
84+
3. **A live stress test for `pe:*` extension discipline.** We have a working example of mapping institutionally-specific certification metadata (`pe:certifiedForModelVersion`, `pe:compatibilityBasis`, `pe:emittedIn`, `pe:ciRunUrl`, `pe:ciGitSha`) onto the TRACE core without polluting TROv shapes. If any of those generalize, we would contribute them upstream.
85+
86+
We will keep notes as the implementation proceeds. The TRACE team is welcome to any of this material as part of their grant work.
87+
88+
## Open questions
89+
90+
- **Per-household frame as default or opt-in.** The meeting did not reach consensus on this; we flagged it as unsettled. Default-on has downstream-analysis utility but file-size and privacy-posture costs. Default-off makes TROs smaller but forces downstream researchers to rerun the simulation for any custom split. Design choice should be made deliberately with trade-offs listed, not defaulted to either extreme.
91+
- **Retention and addressing of webapp-run TROs.** These become permanent citations. Commitments needed on durable URLs, content-addressing, migration policy for storage-provider changes, and whether we ever prune. Zenodo as a secondary / mirror target is worth serious consideration — Hugging Face does not publish a preservation commitment, and a TRO URL that 404s in 2040 is a worse outcome than a TRO URL that 404s in a PolicyEngine-controlled bucket.
92+
- **Signing key and key trust model.** A PolicyEngine service-account signature is straightforward to implement; the harder question is how a reader in 2040 verifies the signature belongs to PolicyEngine. Options include a published keychain rooted in a DNS TXT record, a Sigstore-style transparency log, or GCP workload-identity with short-lived signatures. Chain-of-trust design deserves more thought than "we sign it with a service account."
93+
- **Binding to the actual production runtime.** CI run URL + git SHA documents how the container that ran the simulation was *built*. The TRO should additionally bind the running container image SHA, cloud region, and pod / function instance at execution time. Otherwise the TRO only attests to a build, not a run.
94+
95+
Feedback welcomed from Lars, Tim, Casper, Tara, John — and anyone else reading.

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ uk = [
4646
]
4747
us = [
4848
"policyengine_core>=3.25.0",
49-
"policyengine-us==1.653.3",
49+
"policyengine-us==1.667.1",
5050
]
5151
dev = [
5252
"pytest",
@@ -61,7 +61,7 @@ dev = [
6161
"ruff>=0.9.0",
6262
"policyengine_core>=3.25.0",
6363
"policyengine-uk==2.88.0",
64-
"policyengine-us==1.653.3",
64+
"policyengine-us==1.667.1",
6565
"towncrier>=24.8.0",
6666
"mypy>=1.11.0",
6767
"pytest-cov>=5.0.0",

scripts/generate_trace_tros.py

Lines changed: 8 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,9 @@
33
Writes ``data/release_manifests/{country}.trace.tro.jsonld`` for each
44
country whose bundled manifest ships in the wheel. Run this before
55
releasing a new ``policyengine.py`` version so the packaged TRO
6-
matches the pinned bundle. Requires HTTPS access to the data release
7-
manifest (and ``HUGGING_FACE_TOKEN`` for private country data).
8-
9-
If a country previously had a TRO on disk and the new run cannot
10-
regenerate it (e.g. a missing secret or an unreachable HF endpoint),
11-
the script exits non-zero so the release workflow blocks rather than
12-
silently shipping a stale/missing TRO. If no bundled release manifests
13-
are found at all, the script exits 0 with a notice (nothing to do).
6+
matches the pinned bundle. The richer data release manifest is included
7+
when available; otherwise the TRO still binds the certified dataset
8+
sha256 and URI pinned in the bundled release manifest.
149
"""
1510

1611
from __future__ import annotations
@@ -47,14 +42,11 @@ def regenerate_all() -> tuple[list[Path], list[tuple[str, Path, str]]]:
4742
try:
4843
data_release_manifest = get_data_release_manifest(country_id)
4944
except DataReleaseManifestUnavailableError as exc:
50-
if tro_path.exists():
51-
regressions.append((country_id, tro_path, str(exc)))
52-
else:
53-
print(
54-
f"skipped {country_id}: {exc}",
55-
file=sys.stderr,
56-
)
57-
continue
45+
data_release_manifest = None
46+
print(
47+
f"warning: {country_id}: {exc}; writing limited TRO",
48+
file=sys.stderr,
49+
)
5850
tro = build_trace_tro_from_release_bundle(
5951
country_manifest,
6052
data_release_manifest,

src/policyengine/cli.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
from typing import Optional, Sequence
2020

2121
from policyengine.provenance.manifest import (
22+
DataReleaseManifestUnavailableError,
2223
get_data_release_manifest,
2324
get_release_manifest,
2425
)
@@ -69,7 +70,10 @@ def _parser() -> argparse.ArgumentParser:
6970

7071
def _emit_bundle_tro(country_id: str, out: Optional[Path]) -> int:
7172
country_manifest = get_release_manifest(country_id)
72-
data_release_manifest = get_data_release_manifest(country_id)
73+
try:
74+
data_release_manifest = get_data_release_manifest(country_id)
75+
except DataReleaseManifestUnavailableError:
76+
data_release_manifest = None
7377
tro = build_trace_tro_from_release_bundle(
7478
country_manifest,
7579
data_release_manifest,

src/policyengine/core/tax_benefit_model_version.py

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from policyengine.provenance.manifest import (
88
CountryReleaseManifest,
99
DataCertification,
10+
DataReleaseManifestUnavailableError,
1011
PackageVersion,
1112
get_data_release_manifest,
1213
)
@@ -214,16 +215,20 @@ def release_bundle(self) -> dict[str, Optional[str]]:
214215
def trace_tro(self) -> dict:
215216
"""Build a TRACE TRO for this certified bundle.
216217
217-
Fetches the published data release manifest so the TRO can pin
218-
the exact dataset sha256. Requires a bundled release manifest.
218+
Uses the published data release manifest when available. If it
219+
has not been published, the TRO falls back to the certified
220+
dataset sha256 and URI pinned in the bundled release manifest.
219221
"""
220222
if self.release_manifest is None:
221223
raise ValueError(
222224
"TRACE TRO export requires a bundled country release manifest."
223225
)
224-
data_release_manifest = get_data_release_manifest(
225-
self.release_manifest.country_id
226-
)
226+
try:
227+
data_release_manifest = get_data_release_manifest(
228+
self.release_manifest.country_id
229+
)
230+
except DataReleaseManifestUnavailableError:
231+
data_release_manifest = None
227232
return build_trace_tro_from_release_bundle(
228233
self.release_manifest,
229234
data_release_manifest,

0 commit comments

Comments
 (0)