Commit f8b89e4
Fix age calculation to match R pipeline behavior
Add automatic age correction from date of birth (DOB) to match R pipeline's
fix_age() function. This ensures data quality by always calculating age from
DOB rather than trusting potentially incorrect Excel values.
Changes:
- Add _fix_age_from_dob() function in clean/patient.py (step 5.5)
- Calculate age: tracker_year - birth_year - (1 if tracker_month < birth_month else 0)
- Log warnings and track errors via ErrorCollector for all age corrections
- Handle missing ages, mismatched ages, and negative ages (set to error value)
Validation:
- Tested with 2025_06_CDA tracker: 35 age errors properly corrected and tracked
- Results now match R output (e.g., patient KH_CD016: 18 years, not 21)
- Improvement over R: structured error tracking instead of logging only
Also adds:
- compare_r_vs_python.py: Comprehensive comparison tool for validation
- fastexcel dependency: Required for Excel reading in comparison scripts
Fixes critical data quality issue where incorrect ages from Excel were
propagated to final datasets. Now matches R pipeline behavior while
providing better error tracking and documentation.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>1 parent 27f4baf commit f8b89e4
4 files changed
Lines changed: 567 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
0 commit comments