Skip to content

Commit 08261c0

Browse files
committed
Fix major underreporting bug in CA HCD data
This increase CA permits in recent years by quite a bit: 2018: 94,154 -> 97,963 2019: 101,411 -> 104,872 2020: 92,925 -> 99,806 2021: 115,382 -> 121,651 2022: 117,134 -> 127,964 2023: 111,421 -> 123,861 2024: 93,206 -> 104,306 The cause was very-low income (non-deed-restricted) and low-income (deed-restricted) projects being dropped. The technical reason was very dumb reason (those columns had a mix of numbers and strings, and .sum(numeric_only=True) drops columns that contain non-numeric values). I'm not sure how long this has been an issue. I think when I added the pd.to_numeric(df["BP_ABOVE_MOD_INCOME"], errors="coerce") logic, above moderate income was the only column that had strings, so I only casted that column to numeric. But at some point these other two columns also started including string values. I could look through the history of housing-data-data to figure out exactly when if I wanted to. There is still a discrepancy between my numbers and the CA HCD dashboard's numbers (https://www.hcd.ca.gov/housing-open-data-tools/apr-dashboard). Their numbers are still significantly higher: 2018: 114,513 2019: 121,428 2020: 112,314 2021: 134,619 2022: 136,578 2023: 133,568 2024: 114,930 I'll keep digging into the remaining discrepancy. That is most likely caused by my duplicate filtering logic (filtering out duplicate rows from projects that received both a permit and a COO in the same year). Hopefully my logic is still filtering out duplicates correctly and not unintentionally dropping real permit rows.
1 parent cf2a8eb commit 08261c0

1 file changed

Lines changed: 9 additions & 6 deletions

File tree

python/housing_data/california_hcd_data.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
Because of SB 35 triggers based on the amount of housing permitted, cities have
99
a greater incentive to report this data correctly.
1010
"""
11+
1112
from functools import lru_cache
1213
from pathlib import Path
1314
from typing import Literal, Optional
@@ -36,12 +37,12 @@ def load_california_hcd_data(
3637
# BPS doesn't include mobile homes, so we shouldn't include them here either
3738
df = df[df["UNIT_CAT"] != "MH"].copy()
3839

39-
# Has some values that are not numbers (e.g. "2020-08-02")
40-
df["BP_ABOVE_MOD_INCOME"] = pd.to_numeric(
41-
df["BP_ABOVE_MOD_INCOME"], errors="coerce"
42-
)
40+
# These columns are a mix of string and ints.
41+
# They also have some string values that can't be parsed as numbers (e.g. "2020-08-02")
42+
for col in ["BP_ABOVE_MOD_INCOME", "BP_VLOW_INCOME_NDR", "BP_LOW_INCOME_DR"]:
43+
df[col] = pd.to_numeric(df[col], errors="coerce")
4344

44-
df["units"] = df[BUILDING_PERMIT_COLUMNS].sum(axis="columns", numeric_only=True)
45+
df["units"] = df[BUILDING_PERMIT_COLUMNS].sum(axis="columns")
4546

4647
df = df[
4748
(df["units"] > 0)
@@ -77,7 +78,9 @@ def load_california_hcd_data(
7778
None,
7879
)
7980

80-
assert df["building_type"].isnull().sum() < 50
81+
assert (
82+
df["building_type"].isnull().sum() < 60
83+
), f"{df['building_type'].isnull().sum()} is not less than 60"
8184
df = df[df["building_type"].notnull()]
8285

8386
# Drop rows where YEAR is not parseable as an int

0 commit comments

Comments
 (0)