FEAT: Integrate 2024 EAVS data and stabilize CLC cleaning pipeline#28
FEAT: Integrate 2024 EAVS data and stabilize CLC cleaning pipeline#28yashinlin wants to merge 4 commits into
Conversation
Adds the clean_2024() function, creates the 2024.yaml file, and updates all year YAMLs (2020, 2022) with CLC priority variables (A-list, C1-C9, E1-E3). This also incorporates robust column filtering and essential pipeline stabilization fixes.
Implements data output into parquet, xlsx, and csv formats for both individual years and combined data. Fixes pipeline execution and updates final column mapping logic.
Introduces the calculate_sha256 utility script to ensure data integrity.
|
Update: I’ve resolved the merge conflicts and integrated timeseries cleaning so it runs via uv run -m eavs.clean. Both per-year and timeseries pipelines were run locally. |
tuanpham96
left a comment
There was a problem hiding this comment.
I think things generally look ok. There seems to still be issues with the fips codes though.
|
|
||
| # Add year column and normalize FIPS | ||
| df['year'] = year | ||
| df['fips_code'] = df['fips_code'].astype(str).str.zfill(5).str[:5] |
There was a problem hiding this comment.
I'd suggest adding a bit more comment here for future references, since this was a recurring issue/discussion.
Also, I think there's still some issue with this. There are some entries from Wisconsin and Maine with fips code starting at 00. For example, after cleaning, the entries from Adams county in Wisconsin have multiple fips codes:
jurisdiction_name state fips_code
0 CITY OF ADAMS - ADAMS COUNTY WISCONSIN 00275
1 TOWN OF ADAMS - ADAMS COUNTY WISCONSIN 00300
2 TOWN OF BIG FLATS - ADAMS COUNTY WISCONSIN 07300
3 TOWN OF COLBURN - ADAMS COUNTY WISCONSIN 16075
4 TOWN OF DELL PRAIRIE - ADAMS COUNTY WISCONSIN 19575
5 TOWN OF EASTON - ADAMS COUNTY WISCONSIN 22000
6 VILLAGE OF FRIENDSHIP - ADAMS COUNTY WISCONSIN 27950
7 TOWN OF JACKSON - ADAMS COUNTY WISCONSIN 37625
8 TOWN OF LEOLA - ADAMS COUNTY WISCONSIN 43425
9 TOWN OF LINCOLN - ADAMS COUNTY WISCONSIN 44250
10 TOWN OF MONROE - ADAMS COUNTY WISCONSIN 53725
11 TOWN OF NEW CHESTER - ADAMS COUNTY WISCONSIN 56525
12 TOWN OF NEW HAVEN - ADAMS COUNTY WISCONSIN 56750
13 TOWN OF PRESTON - ADAMS COUNTY WISCONSIN 65450
14 TOWN OF QUINCY - ADAMS COUNTY WISCONSIN 65825
15 TOWN OF RICHFIELD - ADAMS COUNTY WISCONSIN 67425
16 TOWN OF ROME - ADAMS COUNTY WISCONSIN 69275
17 TOWN OF SPRINGVILLE - ADAMS COUNTY WISCONSIN 76350
18 TOWN OF STRONGS PRAIRIE - ADAMS COUNTY WISCONSIN 77800
Based on Wikipedia, Adams county code is 55001.
Based on the Census website for Wisconsin, these are likely fips code for county subdivisions.
Is that right? Do we need county level or subdivision level for downstream analyses?
|
@yashinlin sorry, were there new commits after Dec to reivew? And were you able to figure out the issue with the FIPS code? |
|
No, thanks again for the review! I will look at cleaning up the FIPS column next (unless @danielle Corcoran takes it up!)... also I guess this means Andrew will need to re-merge timeseries because he used the FIPS code as is.. |
Summary of Changes
This branch integrates the 2024 EAVS data and completes a refactoring of the cleaning pipeline for consistency across all years (2020, 2022, 2024)
The history has been interactively rebased into three clean, logical commits for easy review:
Functional Outcomes & Fixes
2024 Data Integration: The pipeline successfully downloads, cleans, and integrates the 2024 EAVS data source.
Pipeline Stability & Validation: The combined pipeline runs cleanly and validates all three years (2020, 2022, 2024) end-to-end.
Fixes Included:
Schema Compatibility: Resolved fips_code column validation failure in Pandera by changing the column type from Series[str] to Series[String] (Pandas nullable string dtype).
File Reading Integrity: Resolved issues reading the 2024 Excel file caused by Windows metadata streams in WSL environments.
Integration of timeseries cleaner into main pipeline Integrated the timeseries cleaner into the main clean pipeline so it runs automatically when executing python -m eavs.clean, while keeping the timeseries logic in its own module for standalone use. It also cleans up schema placement and ensures consistent Parquet outputs across datasets. Both clean.py and clean_timeseries.py were run locally to confirm validation and outputs.
Testing and Verification Steps
The history of this branch was cleaned and successfully rebased onto main (Andrew's last commit: 861308f). To verify locally, reviewers can ensure the local environment is up-to-date and run the download and clean steps:
Update local main:
Bash
Bash
Expected Result: The pipeline should finish with a SUCCESS message after validating the combined data, generating new multi-format files in the data/cleaned directory.