Skip to content

Commit 91b96fe

Browse files
committed
1. Patient Data Extraction Module (src/a4d/extract/patient.py)
- 180 lines of clean, well-tested code - 91% code coverage - Handles all edge cases from real tracker files (2024, 2019, 2018) 2. Key Features Implemented: - ✅ Read all month sheets from Excel trackers - ✅ Extract tracker year from sheet names or filename - ✅ Merge two-row headers with horizontal fill-forward - ✅ R-compatible duplicate column merging (concatenate values with commas, like tidyr::unite()) - ✅ Apply synonym mapping for column harmonization - ✅ Add metadata columns (sheet_name, tracker_month, tracker_year, file_name) - ✅ Combine sheets with type-safe concatenation - ✅ Filter invalid patient rows 3. Testing: 25 comprehensive tests covering all edge cases 4. Documentation Updates: - Updated MIGRATION_GUIDE.md with Phase 2 progress - Updated CLAUDE.md with current status - Created memory: r_implementation_check.md - reminder to always verify against R code
1 parent ed2c1d4 commit 91b96fe

18 files changed

Lines changed: 1430 additions & 208 deletions

File tree

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,3 +57,4 @@ Both projects use the same reference data:
5757
- `reference_data/provinces/` - Allowed provinces
5858

5959
**Do not modify these** without testing both R and Python pipelines.
60+
- Always check your implementation against the original R pipeline and check if the logic is the same

a4d-python/docs/CLAUDE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,15 @@
11
# CLAUDE.md
22

3-
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4-
53
## Project Overview
64

75
**Python implementation** of the A4D medical tracker data processing pipeline (migrating from R).
86

97
This project processes, cleans, and ingests medical tracker data (Excel files) for the CorrelAid A4D project.
108
It extracts patient and product data from Excel trackers, validates and cleans the data, and creates structured tables for ingestion into Google BigQuery.
119

12-
**Migration Status**: Active development
10+
**Migration Status**: Phase 2 - Patient Extraction Complete ✅
1311
**See**: [Migration Guide](migration/MIGRATION_GUIDE.md) for complete migration details
12+
**Last Updated**: 2025-10-24
1413

1514
## Package Structure
1615

@@ -87,7 +86,7 @@ A4D_UPLOAD_BUCKET=a4dphase2_output
8786

8887
### Data Flow
8988

90-
```
89+
```text
9190
Query BigQuery → Identify changed trackers
9291
9392
For each tracker (parallel):
@@ -154,3 +153,4 @@ When migrating R code:
154153
3. Error tracking via `ErrorCollector` class
155154
4. Read R scripts to understand logic, then apply Python patterns
156155
5. Compare outputs with R pipeline after each phase
156+
6. Do not migrate blindly – adapt to Pythonic idioms and performance best practices

a4d-python/docs/migration/MIGRATION_GUIDE.md

Lines changed: 70 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,11 @@ Complete guide for migrating the A4D pipeline from R to Python.
66

77
## Quick Reference
88

9-
**Status**: Phase 0 Complete ✅ (Project setup)
10-
**Next**: Phase 1 - Core Infrastructure
9+
**Status**: Phase 2 - Patient Extraction Complete ✅
10+
**Next**: Export raw parquet + Product extraction
1111
**Timeline**: 12-13 weeks total
1212
**Current Branch**: `migration`
13+
**Last Updated**: 2025-10-24
1314

1415
---
1516

@@ -254,17 +255,27 @@ job.result()
254255
- [x] Add GitHub Actions CI
255256
- [x] Create basic config.py
256257

257-
### Phase 1: Core Infrastructure (NEXT)
258+
### Phase 1: Core Infrastructure (PARTIAL)
259+
- [x] **reference/synonyms.py** - Column name mapping ✅
260+
- Load YAML files (reuse from reference_data/)
261+
- Create reverse mapping dict
262+
- `rename_columns()` method with strict mode
263+
- Comprehensive test coverage
264+
265+
- [x] **reference/provinces.py** - Province validation ✅
266+
- Load allowed provinces YAML
267+
- Case-insensitive validation
268+
- Country mapping
269+
270+
- [x] **reference/loaders.py** - YAML loading utilities ✅
271+
- Find reference_data directory
272+
- Load YAML with validation
273+
258274
- [ ] **logging.py** - loguru setup with JSON output
259275
- Console handler (pretty, colored)
260276
- File handler (JSON for BigQuery upload)
261277
- `file_logger()` context manager
262278

263-
- [ ] **synonyms/mapper.py** - Column name mapping
264-
- Load YAML files (reuse from reference_data/)
265-
- Create reverse mapping dict
266-
- `rename_dataframe()` method
267-
268279
- [ ] **clean/converters.py** - Type conversion with error tracking
269280
- `ErrorCollector` class
270281
- `safe_convert_column()` function
@@ -289,20 +300,30 @@ job.result()
289300

290301
- [ ] **utils/paths.py** - Path utilities
291302

292-
- [ ] **Write tests** for all infrastructure
293-
294-
### Phase 2: Script 1 - Extraction (Week 3-5)
295-
- [ ] **extract/patient.py**
296-
- Read Excel with Polars/openpyxl
297-
- Apply synonym mapping
298-
- Extract from all sheets
299-
- Export raw parquet
300-
301-
- [ ] **extract/product.py**
303+
### Phase 2: Script 1 - Extraction (IN PROGRESS) ⚡
304+
- [x] **extract/patient.py** - COMPLETED ✅
305+
- [x] Read Excel with openpyxl (read-only, single-pass optimization)
306+
- [x] Find all month sheets automatically
307+
- [x] Extract tracker year from sheet names or filename
308+
- [x] Read and merge two-row headers (with horizontal fill-forward)
309+
- [x] Handle merged cells creating duplicate columns (R-compatible merge with commas)
310+
- [x] Apply synonym mapping with `ColumnMapper`
311+
- [x] Extract from all month sheets with metadata (sheet_name, tracker_month, tracker_year, file_name)
312+
- [x] Combine sheets with `diagonal_relaxed` (handles type mismatches)
313+
- [x] Filter invalid rows (null patient_id, or "0"/"0" combinations)
314+
- [x] 25 comprehensive tests (110 total test suite)
315+
- [x] 91% code coverage for patient.py
316+
- [ ] Export raw parquet (next step)
317+
318+
- [ ] **extract/product.py** - TODO
302319
- Same pattern as patient
303320

304-
- [ ] **Test on sample trackers**
305-
- [ ] **Compare outputs with R pipeline**
321+
- [x] **Test on sample trackers** - DONE
322+
- Tested with 2024, 2019, 2018 trackers
323+
- Handles format variations across years
324+
325+
- [ ] **Compare outputs with R pipeline** - TODO
326+
- Need to run both pipelines and compare parquet outputs
306327

307328
### Phase 3: Script 2 - Cleaning (Week 5-7)
308329
- [ ] **clean/patient.py**
@@ -638,6 +659,35 @@ No migration needed - just reference from Python code.
638659

639660
---
640661

662+
## Recent Progress (2025-10-24)
663+
664+
### ✅ Completed: Patient Data Extraction
665+
- **Module**: `src/a4d/extract/patient.py` (180 lines, 91% coverage)
666+
- **Tests**: 25 tests in `tests/test_extract/test_patient.py` (152 lines)
667+
- **Key Features**:
668+
- Single-pass read-only Excel loading for optimal performance
669+
- Automatic month sheet detection and year extraction
670+
- Two-row header merging with horizontal fill-forward logic
671+
- **R-compatible duplicate column handling**: Merges values with commas (like `tidyr::unite()`)
672+
- Synonym-based column harmonization
673+
- Multi-sheet extraction with metadata (sheet_name, tracker_month, tracker_year, file_name)
674+
- Type-safe concatenation with `diagonal_relaxed`
675+
- Intelligent row filtering (removes invalid patient_id patterns)
676+
677+
### 🔑 Key Learnings
678+
1. **Always verify against R implementation** - Initially implemented incorrect duplicate column handling (renaming) instead of correct approach (merging values)
679+
2. **Polars constraints** - Cannot have duplicate column names, must handle before DataFrame creation
680+
3. **Type mismatches** - Use `diagonal_relaxed` when concatenating DataFrames with schema differences
681+
4. **Simplicity wins** - Refactored complex nested loops to elegant dict-based approach (26% code reduction)
682+
683+
### 📝 Next Steps
684+
1. Add parquet export to `extract/patient.py`
685+
2. Implement `extract/product.py` (similar pattern)
686+
3. Compare outputs with R pipeline (run both and validate parity)
687+
4. Move to Phase 3: Cleaning module
688+
689+
---
690+
641691
## Questions During Migration
642692

643693
1. How to handle date parsing edge cases?

a4d-python/scripts/profile_extraction_detailed.py

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -57,10 +57,24 @@ def profile_extraction_phases(tracker_file, sheet_name, year):
5757
header_row_2 = data_start_row - 2
5858

5959
max_cols = 100
60-
header_1_raw = list(ws.iter_rows(min_row=header_row_1, max_row=header_row_1,
61-
min_col=1, max_col=max_cols, values_only=True))[0]
62-
header_2_raw = list(ws.iter_rows(min_row=header_row_2, max_row=header_row_2,
63-
min_col=1, max_col=max_cols, values_only=True))[0]
60+
header_1_raw = list(
61+
ws.iter_rows(
62+
min_row=header_row_1,
63+
max_row=header_row_1,
64+
min_col=1,
65+
max_col=max_cols,
66+
values_only=True,
67+
)
68+
)[0]
69+
header_2_raw = list(
70+
ws.iter_rows(
71+
min_row=header_row_2,
72+
max_row=header_row_2,
73+
min_col=1,
74+
max_col=max_cols,
75+
values_only=True,
76+
)
77+
)[0]
6478

6579
# Trim to actual width
6680
last_col = max_cols

a4d-python/src/a4d/clean/converters.py

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,6 @@
1111
4. Replace failures with error value
1212
"""
1313

14-
from typing import Optional
15-
1614
import polars as pl
1715

1816
from a4d.config import settings
@@ -24,7 +22,7 @@ def safe_convert_column(
2422
column: str,
2523
target_type: pl.DataType,
2624
error_collector: ErrorCollector,
27-
error_value: Optional[float | str] = None,
25+
error_value: float | str | None = None,
2826
file_name_col: str = "file_name",
2927
patient_id_col: str = "patient_id",
3028
) -> pl.DataFrame:
@@ -74,9 +72,7 @@ def safe_convert_column(
7472
df = df.with_columns(pl.col(column).alias(f"_orig_{column}"))
7573

7674
# Try vectorized conversion (strict=False allows nulls for failures)
77-
df = df.with_columns(
78-
pl.col(column).cast(target_type, strict=False).alias(f"_conv_{column}")
79-
)
75+
df = df.with_columns(pl.col(column).cast(target_type, strict=False).alias(f"_conv_{column}"))
8076

8177
# Detect failures: became null but wasn't null before
8278
failed_mask = pl.col(f"_conv_{column}").is_null() & pl.col(f"_orig_{column}").is_not_null()
@@ -129,9 +125,7 @@ def correct_decimal_sign(df: pl.DataFrame, column: str) -> pl.DataFrame:
129125
if column not in df.columns:
130126
return df
131127

132-
df = df.with_columns(
133-
pl.col(column).cast(pl.Utf8).str.replace(",", ".").alias(column)
134-
)
128+
df = df.with_columns(pl.col(column).cast(pl.Utf8).str.replace(",", ".").alias(column))
135129

136130
return df
137131

@@ -210,7 +204,7 @@ def safe_convert_multiple_columns(
210204
columns: list[str],
211205
target_type: pl.DataType,
212206
error_collector: ErrorCollector,
213-
error_value: Optional[float | str] = None,
207+
error_value: float | str | None = None,
214208
file_name_col: str = "file_name",
215209
patient_id_col: str = "patient_id",
216210
) -> pl.DataFrame:

a4d-python/src/a4d/errors.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414
import polars as pl
1515
from pydantic import BaseModel, Field
1616

17-
1817
# Error code types based on R pipeline
1918
ErrorCode = Literal[
2019
"type_conversion", # Failed to convert type (e.g., "abc" -> int)

0 commit comments

Comments
 (0)