Skip to content

Commit 63fd3c7

Browse files
committed
Optimize patient extraction with single-pass read-only loading
- Eliminate two-pass workbook loading (structure + read-only modes) - Implement forward-fill logic for horizontally merged cells - Achieve 72% average speedup (2.4s → 0.4s per sheet) - Update profiling scripts and documentation - All tests pass with correct column counts and headers
1 parent 05c2cb5 commit 63fd3c7

16 files changed

Lines changed: 1872 additions & 1 deletion
Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
# Patient Data Extraction - Performance Profiling Summary
2+
3+
**Date**: 2025-10-23
4+
**Files Tested**: 2024 Sibu Hospital (Jan24), 2019 Penang General Hospital (Feb19)
5+
6+
## Executive Summary
7+
8+
**OPTIMIZED - Single-pass extraction:**
9+
- **2024 tracker**: 0.877s per sheet (66% faster than two-pass)
10+
- **2019 tracker**: 0.080s per sheet (96% faster than two-pass)
11+
12+
**Primary bottleneck**: openpyxl workbook loading (95-99% of time)
13+
**Optimization**: Eliminated second workbook load by implementing forward-fill for horizontally merged cells
14+
15+
## Detailed Breakdown
16+
17+
### Time Distribution by Phase (OPTIMIZED - Single-pass)
18+
19+
| Phase | 2024 Tracker | 2019 Tracker | Average | % of Total |
20+
|-------|--------------|--------------|---------|------------|
21+
| 1. Load workbook (read-only) | 0.625s | 0.051s | **0.338s** | **79-85%** |
22+
| 7. Build Polars DataFrame | 0.086s | 0.000s | 0.043s | 0-12% |
23+
| 3. Read headers | 0.010s | 0.006s | 0.008s | 1-9% |
24+
| 2. Find data start row | 0.005s | 0.004s | 0.004s | 1-6% |
25+
| 5. Read data rows | 0.006s | 0.003s | 0.004s | 1-5% |
26+
| 4. Merge headers | <0.001s | <0.001s | <0.001s | <1% |
27+
| 6. Close workbook | <0.001s | <0.001s | <0.001s | <1% |
28+
| **TOTAL** | **0.732s** | **0.064s** | **0.398s** | **100%** |
29+
30+
**Previous two-pass approach**: 2.583s (2024), 1.973s (2019) - avg 2.278s
31+
**Current single-pass approach**: 0.732s (2024), 0.064s (2019) - avg 0.398s
32+
**Improvement**: 72% faster on average (66-96% depending on file)
33+
34+
### Top Library Bottlenecks (from cProfile) - OPTIMIZED
35+
36+
**Current single-pass approach** (read-only mode only):
37+
38+
1. **openpyxl.reader.excel.load_workbook**: 0.6-0.8s (79-85% of time)
39+
- `read_worksheets()`: Most of the time
40+
- `parse_dimensions()`: XML parsing
41+
- No style/formatting overhead (read_only=True)
42+
43+
2. **XML parsing**: 0.4-0.6s
44+
- ElementTree parsing Excel's XML format
45+
- Required by openpyxl, cannot be optimized further
46+
47+
3. **Polars DataFrame construction**: 0.04-0.09s (0-12%)
48+
- String conversion for all cells
49+
- Acceptable overhead
50+
51+
## Optimization Assessment
52+
53+
### ✅ Successfully Optimized
54+
55+
1. **Single-pass read-only extraction**
56+
- Eliminated second workbook load (structure mode)
57+
- Only uses `read_only=True, data_only=True, keep_vba=False, keep_links=False`
58+
- **Result**: 66-96% faster than two-pass approach
59+
60+
2. **Forward-fill logic for horizontally merged cells**
61+
- Tracks `prev_h2` to propagate header across merged columns
62+
- Example: "Updated HbA1c" fills forward to "(dd-mmm-yyyy)" column
63+
- **Result**: Correct headers without needing `merged_cells` attribute
64+
65+
3. **Early termination**
66+
- Stops at first empty row
67+
- Skips rows with None in column A
68+
69+
4. **Efficient iteration**
70+
- Uses `iter_rows()` instead of cell-by-cell access
71+
- Pre-reads fixed width (100 cols) and trims to actual data
72+
73+
### Key Insight
74+
75+
**Initial assumption was WRONG:**
76+
- Thought: "Need structure mode for merged cells, can't read vertically merged cells in read-only mode"
77+
- Reality: **Read-only mode CAN read vertically merged cells** - each cell has the value
78+
- Real problem: **Horizontally merged cells** need forward-fill logic
79+
- Solution: Track previous h2 value and fill forward when h2=None but h1 exists
80+
81+
**Why single-pass works:**
82+
- Vertically merged cells (e.g., "Patient ID" spanning 2 rows): Read-only mode reads both cells directly
83+
- Horizontally merged cells (e.g., "Updated HbA1c" spanning 2 cols): Fill forward from previous column
84+
- No need for `merged_cells` attribute at all!
85+
86+
## Recommendations
87+
88+
### For Current Implementation
89+
90+
**Current approach is OPTIMIZED** - single-pass read-only extraction with forward-fill logic.
91+
92+
Remaining bottleneck (79-85% of time) is unavoidable:
93+
- XML parsing of Excel file structure (required by .xlsx format)
94+
- File I/O overhead
95+
- No further optimization possible without changing file format
96+
97+
### For Future Consideration
98+
99+
1. **Caching**: If processing same file multiple times
100+
- Cache extracted DataFrames as Parquet
101+
- Only re-extract when source file changes
102+
103+
2. **Parallel sheet processing**: When processing all months
104+
- Extract each month sheet in parallel
105+
- 12 months could process in ~2-3s instead of 24-60s
106+
107+
3. **Progress reporting**: For user experience
108+
- Show which sheet is being processed
109+
- Estimated time remaining
110+
111+
4. **Streaming**: For very large trackers
112+
- Not needed for current data sizes (10-20 patients per sheet)
113+
- Consider if patient counts exceed 100+ per sheet
114+
115+
## Performance Comparison: R vs Python
116+
117+
**R Pipeline** (openxlsx + readxl):
118+
- Unknown exact timing (not profiled)
119+
- Uses two libraries (complexity)
120+
121+
**Python Pipeline** (openpyxl):
122+
- 2-5 seconds per sheet
123+
- Single library, cleaner code
124+
- Most time spent in unavoidable I/O
125+
126+
**Conclusion**: Both are I/O bound. Python's performance is acceptable and likely comparable to R.
127+
128+
## Test Environment
129+
130+
- **Python**: 3.13.2
131+
- **openpyxl**: Latest version (from uv)
132+
- **Polars**: Latest version
133+
- **OS**: macOS (Darwin 24.6.0)
134+
- **Hardware**: Not specified (user's machine)
135+
136+
## Profiling Commands
137+
138+
```bash
139+
# Full profiling
140+
uv run python scripts/profile_extraction.py
141+
142+
# Detailed phase breakdown
143+
uv run python scripts/profile_extraction_detailed.py
144+
145+
# View saved profile
146+
python -m pstats profiling/extraction_2024.prof
147+
```
148+
149+
## Code Improvements
150+
151+
### Improved Header Detection (2025-10-23)
152+
153+
**Previous approach**: Check if `header_1[1] == header_2[1]` (single column)
154+
155+
**Current approach**: Two-heuristic validation
156+
```python
157+
# 1. Year-based: Multi-line headers introduced starting 2019
158+
is_multiline_year = year >= 2019
159+
160+
# 2. Content-based: Check if ANY pair has both h1 and h2 non-None
161+
# (Single-row headers have title/section text in row above, not data)
162+
has_multiline_content = any(h1 is not None and h2 is not None
163+
for h1, h2 in zip(header_1, header_2))
164+
165+
if is_multiline_year and has_multiline_content:
166+
# Multi-line header logic (merge h1 and h2)
167+
else:
168+
# Single-line header logic (use only h1)
169+
```
170+
171+
**Benefits**:
172+
- More explicit and maintainable
173+
- Validates entire header row, not just one column
174+
- Correctly handles edge cases (e.g., 2018 "Summary of Patient Recruitment" in row above)
175+
- Year-based guard prevents false positives
176+
177+
**Performance**: No change (both checks are negligible vs. I/O time)
178+
179+
## Code Coverage
180+
181+
- **patient.py**: 94% coverage
182+
- **All extraction tests**: 10/10 passing
183+
- **Parameterized tests**: Validate 2018 (Dec), 2019 (Jan/Feb/Mar/Oct), and 2024 (Jan)
184+
- **Year coverage**: Tests single-line (2018) and multi-line (2019+) header formats
185+
186+
## Successful Optimization - Single-Pass Extraction (2025-10-23)
187+
188+
### Problem
189+
Original implementation used two-pass approach:
190+
1. Load workbook in structure mode to detect merged cells (1.95s)
191+
2. Load workbook in read-only mode for fast data reading (0.29s)
192+
193+
**Total time**: ~2.3s average per sheet
194+
195+
### Solution
196+
Implemented **single-pass read-only** extraction with **forward-fill logic** for horizontally merged cells:
197+
198+
```python
199+
# Track previous h2 for horizontal merges
200+
prev_h2 = None
201+
for h1, h2 in zip(header_1, header_2, strict=True):
202+
if h1 and h2:
203+
headers.append(f"{h2} {h1}".strip())
204+
prev_h2 = h2
205+
elif h2:
206+
headers.append(str(h2).strip())
207+
prev_h2 = h2
208+
elif h1:
209+
if prev_h2:
210+
# Horizontally merged cell: fill forward
211+
headers.append(f"{prev_h2} {h1}".strip())
212+
else:
213+
headers.append(str(h1).strip())
214+
else:
215+
headers.append(None)
216+
prev_h2 = None
217+
```
218+
219+
### Key Insight
220+
- Vertically merged cells (spanning rows): Read-only mode can read these directly - no special handling needed
221+
- Horizontally merged cells (spanning columns): Excel sets cell value only in first column, subsequent columns are None
222+
- **Solution**: Fill forward from previous column when h2=None but h1 exists
223+
224+
### Example
225+
```
226+
Col 12: h2="Updated HbA1c", h1="%" → "Updated HbA1c %"
227+
Col 13: h2=None (merged), h1="(dd-mmm-yyyy)" → "Updated HbA1c (dd-mmm-yyyy)"
228+
```
229+
230+
### Performance Results
231+
| Tracker | Before (two-pass) | After (single-pass) | Improvement |
232+
|---------|-------------------|---------------------|-------------|
233+
| 2024 | 2.609s | 0.877s | **66% faster** |
234+
| 2019 | 2.122s | 0.080s | **96% faster** |
235+
236+
### Data Correctness Validation
237+
- ✅ All 10 tests pass
238+
- ✅ Correct column counts: 31 (2024), 25/28/27/27 (2019), 19 (2018)
239+
- ✅ Proper header names including horizontally merged cells
240+
- ✅ Patient IDs validated: MY_SU001-004
241+
242+
### Lessons Learned
243+
1. **Always verify assumptions**: Initial assumption that merged cells can't be read in read-only mode was incorrect
244+
2. **Question complexity**: The two-pass approach was solving a problem (vertical merges) that didn't exist
245+
3. **Root cause analysis**: The real challenge was horizontal merges, which required forward-fill logic
246+
4. **Data-first approach**: Never change test expectations to match wrong output - fix the code instead
84.8 KB
Binary file not shown.
82.5 KB
Binary file not shown.
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
"""Profile patient data extraction to identify performance bottlenecks."""
2+
3+
import cProfile
4+
import pstats
5+
from pathlib import Path
6+
from pstats import SortKey
7+
8+
from a4d.extract.patient import extract_patient_data
9+
10+
# Test with both 2019 and 2024 trackers
11+
TRACKER_2024 = Path(
12+
"/Volumes/USB SanDisk 3.2Gen1 Media/A4D/data/a4dphase2_upload/"
13+
"Malaysia/SBU/2024_Sibu Hospital A4D Tracker.xlsx"
14+
)
15+
TRACKER_2019 = Path(
16+
"/Volumes/USB SanDisk 3.2Gen1 Media/A4D/data/a4dphase2_upload/"
17+
"Malaysia/PNG/2019_Penang General Hospital A4D Tracker_DC.xlsx"
18+
)
19+
20+
21+
def profile_extraction():
22+
"""Run extraction with profiling."""
23+
print("=" * 80)
24+
print("Profiling 2024 tracker (Jan24)")
25+
print("=" * 80)
26+
27+
profiler_2024 = cProfile.Profile()
28+
profiler_2024.enable()
29+
30+
df_2024 = extract_patient_data(TRACKER_2024, "Jan24", 2024)
31+
32+
profiler_2024.disable()
33+
34+
print(f"\nExtracted: {len(df_2024)} rows × {len(df_2024.columns)} columns")
35+
print("\nTop 20 functions by cumulative time:")
36+
print("-" * 80)
37+
38+
stats_2024 = pstats.Stats(profiler_2024)
39+
stats_2024.strip_dirs()
40+
stats_2024.sort_stats(SortKey.CUMULATIVE)
41+
stats_2024.print_stats(20)
42+
43+
print("\n" + "=" * 80)
44+
print("Profiling 2019 tracker (Feb19 - largest sheet)")
45+
print("=" * 80)
46+
47+
profiler_2019 = cProfile.Profile()
48+
profiler_2019.enable()
49+
50+
df_2019 = extract_patient_data(TRACKER_2019, "Feb19", 2019)
51+
52+
profiler_2019.disable()
53+
54+
print(f"\nExtracted: {len(df_2019)} rows × {len(df_2019.columns)} columns")
55+
print("\nTop 20 functions by cumulative time:")
56+
print("-" * 80)
57+
58+
stats_2019 = pstats.Stats(profiler_2019)
59+
stats_2019.strip_dirs()
60+
stats_2019.sort_stats(SortKey.CUMULATIVE)
61+
stats_2019.print_stats(20)
62+
63+
# Save detailed stats to file
64+
output_dir = Path(__file__).parent.parent / "profiling"
65+
output_dir.mkdir(exist_ok=True)
66+
67+
stats_2024.dump_stats(output_dir / "extraction_2024.prof")
68+
stats_2019.dump_stats(output_dir / "extraction_2019.prof")
69+
70+
print("\n" + "=" * 80)
71+
print(f"Detailed profiling data saved to {output_dir}/")
72+
print("View with: python -m pstats profiling/extraction_2024.prof")
73+
print("=" * 80)
74+
75+
76+
if __name__ == "__main__":
77+
profile_extraction()

0 commit comments

Comments
 (0)