CorrelAid
diff --git a/‎CLAUDE.md‎
Lines changed: 1 addition & 0 deletions b/‎CLAUDE.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎a4d-python/docs/CLAUDE.md‎
Lines changed: 4 additions & 4 deletions b/‎a4d-python/docs/CLAUDE.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎a4d-python/docs/migration/MIGRATION_GUIDE.md‎
Lines changed: 70 additions & 20 deletions b/‎a4d-python/docs/migration/MIGRATION_GUIDE.md‎
Lines changed: 70 additions & 20 deletions
diff --git a/‎a4d-python/scripts/profile_extraction_detailed.py‎
Lines changed: 18 additions & 4 deletions b/‎a4d-python/scripts/profile_extraction_detailed.py‎
Lines changed: 18 additions & 4 deletions
diff --git a/‎a4d-python/src/a4d/clean/converters.py‎
Lines changed: 4 additions & 10 deletions b/‎a4d-python/src/a4d/clean/converters.py‎
Lines changed: 4 additions & 10 deletions
diff --git a/‎a4d-python/src/a4d/errors.py‎
Lines changed: 0 additions & 1 deletion b/‎a4d-python/src/a4d/errors.py‎
Lines changed: 0 additions & 1 deletion
@@ -57,3 +57,4 @@ Both projects use the same reference data:
 - `reference_data/provinces/` - Allowed provinces
 
 **Do not modify these** without testing both R and Python pipelines.
+- Always check your implementation against the original R pipeline and check if the logic is the same
@@ -1,16 +1,15 @@
 # CLAUDE.md
 
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-
 ## Project Overview
 
 **Python implementation** of the A4D medical tracker data processing pipeline (migrating from R).
 
 This project processes, cleans, and ingests medical tracker data (Excel files) for the CorrelAid A4D project.
 It extracts patient and product data from Excel trackers, validates and cleans the data, and creates structured tables for ingestion into Google BigQuery.
 
-**Migration Status**: Active development
+**Migration Status**: Phase 2 - Patient Extraction Complete ✅
 **See**: [Migration Guide](migration/MIGRATION_GUIDE.md) for complete migration details
+**Last Updated**: 2025-10-24
 
 ## Package Structure
 
@@ -87,7 +86,7 @@ A4D_UPLOAD_BUCKET=a4dphase2_output
 
 ### Data Flow
 
-```
+```text
 Query BigQuery → Identify changed trackers
        ↓
 For each tracker (parallel):
@@ -154,3 +153,4 @@ When migrating R code:
 3. Error tracking via `ErrorCollector` class
 4. Read R scripts to understand logic, then apply Python patterns
 5. Compare outputs with R pipeline after each phase
+6. Do not migrate blindly – adapt to Pythonic idioms and performance best practices
@@ -6,10 +6,11 @@ Complete guide for migrating the A4D pipeline from R to Python.
 
 ## Quick Reference
 
-**Status**: Phase 0 Complete ✅ (Project setup)
-**Next**: Phase 1 - Core Infrastructure
+**Status**: Phase 2 - Patient Extraction Complete ✅
+**Next**: Export raw parquet + Product extraction
 **Timeline**: 12-13 weeks total
 **Current Branch**: `migration`
+**Last Updated**: 2025-10-24
 
 ---
 
@@ -254,17 +255,27 @@ job.result()
 - [x] Add GitHub Actions CI
 - [x] Create basic config.py
 
-### Phase 1: Core Infrastructure (NEXT)
+### Phase 1: Core Infrastructure (PARTIAL)
+- [x] **reference/synonyms.py** - Column name mapping ✅
+  - Load YAML files (reuse from reference_data/)
+  - Create reverse mapping dict
+  - `rename_columns()` method with strict mode
+  - Comprehensive test coverage
+
+- [x] **reference/provinces.py** - Province validation ✅
+  - Load allowed provinces YAML
+  - Case-insensitive validation
+  - Country mapping
+
+- [x] **reference/loaders.py** - YAML loading utilities ✅
+  - Find reference_data directory
+  - Load YAML with validation
+
 - [ ] **logging.py** - loguru setup with JSON output
   - Console handler (pretty, colored)
   - File handler (JSON for BigQuery upload)
   - `file_logger()` context manager
 
-- [ ] **synonyms/mapper.py** - Column name mapping
-  - Load YAML files (reuse from reference_data/)
-  - Create reverse mapping dict
-  - `rename_dataframe()` method
-
 - [ ] **clean/converters.py** - Type conversion with error tracking
   - `ErrorCollector` class
   - `safe_convert_column()` function
@@ -289,20 +300,30 @@ job.result()
 
 - [ ] **utils/paths.py** - Path utilities
 
-- [ ] **Write tests** for all infrastructure
-
-### Phase 2: Script 1 - Extraction (Week 3-5)
-- [ ] **extract/patient.py**
-  - Read Excel with Polars/openpyxl
-  - Apply synonym mapping
-  - Extract from all sheets
-  - Export raw parquet
-
-- [ ] **extract/product.py**
+### Phase 2: Script 1 - Extraction (IN PROGRESS) ⚡
+- [x] **extract/patient.py** - COMPLETED ✅
+  - [x] Read Excel with openpyxl (read-only, single-pass optimization)
+  - [x] Find all month sheets automatically
+  - [x] Extract tracker year from sheet names or filename
+  - [x] Read and merge two-row headers (with horizontal fill-forward)
+  - [x] Handle merged cells creating duplicate columns (R-compatible merge with commas)
+  - [x] Apply synonym mapping with `ColumnMapper`
+  - [x] Extract from all month sheets with metadata (sheet_name, tracker_month, tracker_year, file_name)
+  - [x] Combine sheets with `diagonal_relaxed` (handles type mismatches)
+  - [x] Filter invalid rows (null patient_id, or "0"/"0" combinations)
+  - [x] 25 comprehensive tests (110 total test suite)
+  - [x] 91% code coverage for patient.py
+  - [ ] Export raw parquet (next step)
+
+- [ ] **extract/product.py** - TODO
   - Same pattern as patient
 
-- [ ] **Test on sample trackers**
-- [ ] **Compare outputs with R pipeline**
+- [x] **Test on sample trackers** - DONE
+  - Tested with 2024, 2019, 2018 trackers
+  - Handles format variations across years
+
+- [ ] **Compare outputs with R pipeline** - TODO
+  - Need to run both pipelines and compare parquet outputs
 
 ### Phase 3: Script 2 - Cleaning (Week 5-7)
 - [ ] **clean/patient.py**
@@ -638,6 +659,35 @@ No migration needed - just reference from Python code.
 
 ---
 
+## Recent Progress (2025-10-24)
+
+### ✅ Completed: Patient Data Extraction
+- **Module**: `src/a4d/extract/patient.py` (180 lines, 91% coverage)
+- **Tests**: 25 tests in `tests/test_extract/test_patient.py` (152 lines)
+- **Key Features**:
+  - Single-pass read-only Excel loading for optimal performance
+  - Automatic month sheet detection and year extraction
+  - Two-row header merging with horizontal fill-forward logic
+  - **R-compatible duplicate column handling**: Merges values with commas (like `tidyr::unite()`)
+  - Synonym-based column harmonization
+  - Multi-sheet extraction with metadata (sheet_name, tracker_month, tracker_year, file_name)
+  - Type-safe concatenation with `diagonal_relaxed`
+  - Intelligent row filtering (removes invalid patient_id patterns)
+
+### 🔑 Key Learnings
+1. **Always verify against R implementation** - Initially implemented incorrect duplicate column handling (renaming) instead of correct approach (merging values)
+2. **Polars constraints** - Cannot have duplicate column names, must handle before DataFrame creation
+3. **Type mismatches** - Use `diagonal_relaxed` when concatenating DataFrames with schema differences
+4. **Simplicity wins** - Refactored complex nested loops to elegant dict-based approach (26% code reduction)
+
+### 📝 Next Steps
+1. Add parquet export to `extract/patient.py`
+2. Implement `extract/product.py` (similar pattern)
+3. Compare outputs with R pipeline (run both and validate parity)
+4. Move to Phase 3: Cleaning module
+
+---
+
 ## Questions During Migration
 
 1. How to handle date parsing edge cases?
 
@@ -57,10 +57,24 @@ def profile_extraction_phases(tracker_file, sheet_name, year):
     header_row_2 = data_start_row - 2
 
     max_cols = 100
-    header_1_raw = list(ws.iter_rows(min_row=header_row_1, max_row=header_row_1,
-                                      min_col=1, max_col=max_cols, values_only=True))[0]
-    header_2_raw = list(ws.iter_rows(min_row=header_row_2, max_row=header_row_2,
-                                      min_col=1, max_col=max_cols, values_only=True))[0]
+    header_1_raw = list(
+        ws.iter_rows(
+            min_row=header_row_1,
+            max_row=header_row_1,
+            min_col=1,
+            max_col=max_cols,
+            values_only=True,
+        )
+    )[0]
+    header_2_raw = list(
+        ws.iter_rows(
+            min_row=header_row_2,
+            max_row=header_row_2,
+            min_col=1,
+            max_col=max_cols,
+            values_only=True,
+        )
+    )[0]
 
     # Trim to actual width
     last_col = max_cols
 
@@ -11,8 +11,6 @@
 4. Replace failures with error value
 """
 
-from typing import Optional
-
 import polars as pl
 
 from a4d.config import settings
@@ -24,7 +22,7 @@ def safe_convert_column(
     column: str,
     target_type: pl.DataType,
     error_collector: ErrorCollector,
-    error_value: Optional[float | str] = None,
+    error_value: float | str | None = None,
     file_name_col: str = "file_name",
     patient_id_col: str = "patient_id",
 ) -> pl.DataFrame:
@@ -74,9 +72,7 @@ def safe_convert_column(
     df = df.with_columns(pl.col(column).alias(f"_orig_{column}"))
 
     # Try vectorized conversion (strict=False allows nulls for failures)
-    df = df.with_columns(
-        pl.col(column).cast(target_type, strict=False).alias(f"_conv_{column}")
-    )
+    df = df.with_columns(pl.col(column).cast(target_type, strict=False).alias(f"_conv_{column}"))
 
     # Detect failures: became null but wasn't null before
     failed_mask = pl.col(f"_conv_{column}").is_null() & pl.col(f"_orig_{column}").is_not_null()
@@ -129,9 +125,7 @@ def correct_decimal_sign(df: pl.DataFrame, column: str) -> pl.DataFrame:
     if column not in df.columns:
         return df
 
-    df = df.with_columns(
-        pl.col(column).cast(pl.Utf8).str.replace(",", ".").alias(column)
-    )
+    df = df.with_columns(pl.col(column).cast(pl.Utf8).str.replace(",", ".").alias(column))
 
     return df
 
@@ -210,7 +204,7 @@ def safe_convert_multiple_columns(
     columns: list[str],
     target_type: pl.DataType,
     error_collector: ErrorCollector,
-    error_value: Optional[float | str] = None,
+    error_value: float | str | None = None,
     file_name_col: str = "file_name",
     patient_id_col: str = "patient_id",
 ) -> pl.DataFrame:
 
@@ -14,7 +14,6 @@
 import polars as pl
 from pydantic import BaseModel, Field
 
-
 # Error code types based on R pipeline
 ErrorCode = Literal[
     "type_conversion",  # Failed to convert type (e.g., "abc" -> int)
Original file line number	Diff line number	Diff line change
`@@ -57,3 +57,4 @@ Both projects use the same reference data:`
`57`	`57`	- `reference_data/provinces/` - Allowed provinces
`58`	`58`
`59`	`59`	`Do not modify these without testing both R and Python pipelines.`
	`60`	`+- Always check your implementation against the original R pipeline and check if the logic is the same`