Fix major underreporting bug in CA HCD data

sid-kap · sid-kap · commit 08261c0a50f1 · 2026-01-25T01:43:20.000-05:00
This increase CA permits in recent years by quite a bit: 2018: 94,154 -> 97,963 2019: 101,411 -> 104,872 2020: 92,925 -> 99,806 2021: 115,382 -> 121,651 2022: 117,134 -> 127,964 2023: 111,421 -> 123,861 2024: 93,206 -> 104,306 The cause was very-low income (non-deed-restricted) and low-income (deed-restricted) projects being dropped. The technical reason was very dumb reason (those columns had a mix of numbers and strings, and .sum(numeric_only=True) drops columns that contain non-numeric values). I'm not sure how long this has been an issue. I think when I added the pd.to_numeric(df["BP_ABOVE_MOD_INCOME"], errors="coerce") logic, above moderate income was the only column that had strings, so I only casted that column to numeric. But at some point these other two columns also started including string values. I could look through the history of housing-data-data to figure out exactly when if I wanted to. There is still a discrepancy between my numbers and the CA HCD dashboard's numbers (https://www.hcd.ca.gov/housing-open-data-tools/apr-dashboard). Their numbers are still significantly higher: 2018: 114,513 2019: 121,428 2020: 112,314 2021: 134,619 2022: 136,578 2023: 133,568 2024: 114,930 I'll keep digging into the remaining discrepancy. That is most likely caused by my duplicate filtering logic (filtering out duplicate rows from projects that received both a permit and a COO in the same year). Hopefully my logic is still filtering out duplicates correctly and not unintentionally dropping real permit rows.
diff --git a/python/housing_data/california_hcd_data.py b/python/housing_data/california_hcd_data.py
@@ -8,6 +8,7 @@
 Because of SB 35 triggers based on the amount of housing permitted, cities have
 a greater incentive to report this data correctly.
 """
+
 from functools import lru_cache
 from pathlib import Path
 from typing import Literal, Optional
@@ -36,12 +37,12 @@ def load_california_hcd_data(
     # BPS doesn't include mobile homes, so we shouldn't include them here either
     df = df[df["UNIT_CAT"] != "MH"].copy()
 
-    # Has some values that are not numbers (e.g. "2020-08-02")
-    df["BP_ABOVE_MOD_INCOME"] = pd.to_numeric(
-        df["BP_ABOVE_MOD_INCOME"], errors="coerce"
-    )
+    # These columns are a mix of string and ints.
+    # They also have some string values that can't be parsed as numbers (e.g. "2020-08-02")
+    for col in ["BP_ABOVE_MOD_INCOME", "BP_VLOW_INCOME_NDR", "BP_LOW_INCOME_DR"]:
+        df[col] = pd.to_numeric(df[col], errors="coerce")
 
-    df["units"] = df[BUILDING_PERMIT_COLUMNS].sum(axis="columns", numeric_only=True)
+    df["units"] = df[BUILDING_PERMIT_COLUMNS].sum(axis="columns")
 
     df = df[
         (df["units"] > 0)
@@ -77,7 +78,9 @@ def load_california_hcd_data(
         None,
     )
 
-    assert df["building_type"].isnull().sum() < 50
+    assert (
+        df["building_type"].isnull().sum() < 60
+    ), f"{df['building_type'].isnull().sum()} is not less than 60"
     df = df[df["building_type"].notnull()]
 
     # Drop rows where YEAR is not parseable as an int