Skip to content

PUF clone half receives zero final weight in released Enhanced CPS #1139

@MaxGhenis

Description

@MaxGhenis

Summary

The released enhanced_cps_2024.h5 contains the PUF clone half, but the final calibrated household_weight assigns it exactly zero total weight. This means the PUF-imputed half is effectively unused in the final ECPS, despite being present in the file.

Live artifact checked

  • policyengine-us-data package version: 1.115.4
  • HuggingFace artifact: policyengine/policyengine-us-data/enhanced_cps_2024.h5
  • HF snapshot: 0c1409119fe197f4604a0e125999f8ebd3c73a21

Result

total_household_weight=161,309,969.280334
cps_household_weight=161,309,969.280334
puf_clone_household_weight=0.000000
cps_share=100.000000%
puf_clone_share=0.000000%
household_rows=41,314
cps_rows=20,657
puf_clone_rows=20,657
cps_positive_weight_rows=9,343
puf_positive_weight_rows=0
puf_max_weight=0.000000

Reproduction

from importlib.metadata import version
import h5py
from huggingface_hub import hf_hub_download

print(version("policyengine-us-data"))
path = hf_hub_download(
    repo_id="policyengine/policyengine-us-data",
    filename="enhanced_cps_2024.h5",
)

def read_period(f, var, period="2024"):
    obj = f[var]
    if isinstance(obj, h5py.Dataset):
        return obj[:]
    return obj[period][:]

with h5py.File(path, "r") as f:
    weight = read_period(f, "household_weight").astype(float)
    clone = read_period(f, "household_is_puf_clone").astype(bool)

print("total", weight.sum())
print("cps", weight[~clone].sum())
print("puf_clone", weight[clone].sum())
print("puf positive rows", (weight[clone] > 0).sum(), "/", clone.sum())

Why this matters

This likely explains why PUF-only or PUF-heavy tax variables can remain far below administrative targets in the final ECPS. For example, local diagnostics showed the PUF source has high-LTCG donors, but final calibrated ECPS places no meaningful weight on the PUF clone half.

Suspected cause

puf_clone_dataset() intentionally creates the clone half with zero household weight. initialize_weight_priors() then turns zero weights into near-zero priors (~1e-6). Since reweighting optimizes weights in log space, those rows appear to be effectively unable to gain meaningful national weight.

Fix direction

Zero-weight clone households need meaningful positive prior mass before reweighting, while retaining calibration constraints strong enough to prevent PUF-heavy variables from exploding. A simple local diagnostic that split initial prior mass 50% CPS / 50% PUF-clone did make clone rows usable, but produced an unstable full rebuild: clone rows received 59.8% of final household weight and 2024 long-term capital gains rose to about 87x the current SOI target. So the fix should combine positive clone priors with validation/guardrails for aggregate targets, especially capital gains.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions