|
| 1 | +# Patient Data Extraction - Performance Profiling Summary |
| 2 | + |
| 3 | +**Date**: 2025-10-23 |
| 4 | +**Files Tested**: 2024 Sibu Hospital (Jan24), 2019 Penang General Hospital (Feb19) |
| 5 | + |
| 6 | +## Executive Summary |
| 7 | + |
| 8 | +**OPTIMIZED - Single-pass extraction:** |
| 9 | +- **2024 tracker**: 0.877s per sheet (66% faster than two-pass) |
| 10 | +- **2019 tracker**: 0.080s per sheet (96% faster than two-pass) |
| 11 | + |
| 12 | +**Primary bottleneck**: openpyxl workbook loading (95-99% of time) |
| 13 | +**Optimization**: Eliminated second workbook load by implementing forward-fill for horizontally merged cells |
| 14 | + |
| 15 | +## Detailed Breakdown |
| 16 | + |
| 17 | +### Time Distribution by Phase (OPTIMIZED - Single-pass) |
| 18 | + |
| 19 | +| Phase | 2024 Tracker | 2019 Tracker | Average | % of Total | |
| 20 | +|-------|--------------|--------------|---------|------------| |
| 21 | +| 1. Load workbook (read-only) | 0.625s | 0.051s | **0.338s** | **79-85%** | |
| 22 | +| 7. Build Polars DataFrame | 0.086s | 0.000s | 0.043s | 0-12% | |
| 23 | +| 3. Read headers | 0.010s | 0.006s | 0.008s | 1-9% | |
| 24 | +| 2. Find data start row | 0.005s | 0.004s | 0.004s | 1-6% | |
| 25 | +| 5. Read data rows | 0.006s | 0.003s | 0.004s | 1-5% | |
| 26 | +| 4. Merge headers | <0.001s | <0.001s | <0.001s | <1% | |
| 27 | +| 6. Close workbook | <0.001s | <0.001s | <0.001s | <1% | |
| 28 | +| **TOTAL** | **0.732s** | **0.064s** | **0.398s** | **100%** | |
| 29 | + |
| 30 | +**Previous two-pass approach**: 2.583s (2024), 1.973s (2019) - avg 2.278s |
| 31 | +**Current single-pass approach**: 0.732s (2024), 0.064s (2019) - avg 0.398s |
| 32 | +**Improvement**: 72% faster on average (66-96% depending on file) |
| 33 | + |
| 34 | +### Top Library Bottlenecks (from cProfile) - OPTIMIZED |
| 35 | + |
| 36 | +**Current single-pass approach** (read-only mode only): |
| 37 | + |
| 38 | +1. **openpyxl.reader.excel.load_workbook**: 0.6-0.8s (79-85% of time) |
| 39 | + - `read_worksheets()`: Most of the time |
| 40 | + - `parse_dimensions()`: XML parsing |
| 41 | + - No style/formatting overhead (read_only=True) |
| 42 | + |
| 43 | +2. **XML parsing**: 0.4-0.6s |
| 44 | + - ElementTree parsing Excel's XML format |
| 45 | + - Required by openpyxl, cannot be optimized further |
| 46 | + |
| 47 | +3. **Polars DataFrame construction**: 0.04-0.09s (0-12%) |
| 48 | + - String conversion for all cells |
| 49 | + - Acceptable overhead |
| 50 | + |
| 51 | +## Optimization Assessment |
| 52 | + |
| 53 | +### ✅ Successfully Optimized |
| 54 | + |
| 55 | +1. **Single-pass read-only extraction** |
| 56 | + - Eliminated second workbook load (structure mode) |
| 57 | + - Only uses `read_only=True, data_only=True, keep_vba=False, keep_links=False` |
| 58 | + - **Result**: 66-96% faster than two-pass approach |
| 59 | + |
| 60 | +2. **Forward-fill logic for horizontally merged cells** |
| 61 | + - Tracks `prev_h2` to propagate header across merged columns |
| 62 | + - Example: "Updated HbA1c" fills forward to "(dd-mmm-yyyy)" column |
| 63 | + - **Result**: Correct headers without needing `merged_cells` attribute |
| 64 | + |
| 65 | +3. **Early termination** |
| 66 | + - Stops at first empty row |
| 67 | + - Skips rows with None in column A |
| 68 | + |
| 69 | +4. **Efficient iteration** |
| 70 | + - Uses `iter_rows()` instead of cell-by-cell access |
| 71 | + - Pre-reads fixed width (100 cols) and trims to actual data |
| 72 | + |
| 73 | +### Key Insight |
| 74 | + |
| 75 | +**Initial assumption was WRONG:** |
| 76 | +- Thought: "Need structure mode for merged cells, can't read vertically merged cells in read-only mode" |
| 77 | +- Reality: **Read-only mode CAN read vertically merged cells** - each cell has the value |
| 78 | +- Real problem: **Horizontally merged cells** need forward-fill logic |
| 79 | +- Solution: Track previous h2 value and fill forward when h2=None but h1 exists |
| 80 | + |
| 81 | +**Why single-pass works:** |
| 82 | +- Vertically merged cells (e.g., "Patient ID" spanning 2 rows): Read-only mode reads both cells directly |
| 83 | +- Horizontally merged cells (e.g., "Updated HbA1c" spanning 2 cols): Fill forward from previous column |
| 84 | +- No need for `merged_cells` attribute at all! |
| 85 | + |
| 86 | +## Recommendations |
| 87 | + |
| 88 | +### For Current Implementation |
| 89 | + |
| 90 | +**Current approach is OPTIMIZED** - single-pass read-only extraction with forward-fill logic. |
| 91 | + |
| 92 | +Remaining bottleneck (79-85% of time) is unavoidable: |
| 93 | +- XML parsing of Excel file structure (required by .xlsx format) |
| 94 | +- File I/O overhead |
| 95 | +- No further optimization possible without changing file format |
| 96 | + |
| 97 | +### For Future Consideration |
| 98 | + |
| 99 | +1. **Caching**: If processing same file multiple times |
| 100 | + - Cache extracted DataFrames as Parquet |
| 101 | + - Only re-extract when source file changes |
| 102 | + |
| 103 | +2. **Parallel sheet processing**: When processing all months |
| 104 | + - Extract each month sheet in parallel |
| 105 | + - 12 months could process in ~2-3s instead of 24-60s |
| 106 | + |
| 107 | +3. **Progress reporting**: For user experience |
| 108 | + - Show which sheet is being processed |
| 109 | + - Estimated time remaining |
| 110 | + |
| 111 | +4. **Streaming**: For very large trackers |
| 112 | + - Not needed for current data sizes (10-20 patients per sheet) |
| 113 | + - Consider if patient counts exceed 100+ per sheet |
| 114 | + |
| 115 | +## Performance Comparison: R vs Python |
| 116 | + |
| 117 | +**R Pipeline** (openxlsx + readxl): |
| 118 | +- Unknown exact timing (not profiled) |
| 119 | +- Uses two libraries (complexity) |
| 120 | + |
| 121 | +**Python Pipeline** (openpyxl): |
| 122 | +- 2-5 seconds per sheet |
| 123 | +- Single library, cleaner code |
| 124 | +- Most time spent in unavoidable I/O |
| 125 | + |
| 126 | +**Conclusion**: Both are I/O bound. Python's performance is acceptable and likely comparable to R. |
| 127 | + |
| 128 | +## Test Environment |
| 129 | + |
| 130 | +- **Python**: 3.13.2 |
| 131 | +- **openpyxl**: Latest version (from uv) |
| 132 | +- **Polars**: Latest version |
| 133 | +- **OS**: macOS (Darwin 24.6.0) |
| 134 | +- **Hardware**: Not specified (user's machine) |
| 135 | + |
| 136 | +## Profiling Commands |
| 137 | + |
| 138 | +```bash |
| 139 | +# Full profiling |
| 140 | +uv run python scripts/profile_extraction.py |
| 141 | + |
| 142 | +# Detailed phase breakdown |
| 143 | +uv run python scripts/profile_extraction_detailed.py |
| 144 | + |
| 145 | +# View saved profile |
| 146 | +python -m pstats profiling/extraction_2024.prof |
| 147 | +``` |
| 148 | + |
| 149 | +## Code Improvements |
| 150 | + |
| 151 | +### Improved Header Detection (2025-10-23) |
| 152 | + |
| 153 | +**Previous approach**: Check if `header_1[1] == header_2[1]` (single column) |
| 154 | + |
| 155 | +**Current approach**: Two-heuristic validation |
| 156 | +```python |
| 157 | +# 1. Year-based: Multi-line headers introduced starting 2019 |
| 158 | +is_multiline_year = year >= 2019 |
| 159 | + |
| 160 | +# 2. Content-based: Check if ANY pair has both h1 and h2 non-None |
| 161 | +# (Single-row headers have title/section text in row above, not data) |
| 162 | +has_multiline_content = any(h1 is not None and h2 is not None |
| 163 | + for h1, h2 in zip(header_1, header_2)) |
| 164 | + |
| 165 | +if is_multiline_year and has_multiline_content: |
| 166 | + # Multi-line header logic (merge h1 and h2) |
| 167 | +else: |
| 168 | + # Single-line header logic (use only h1) |
| 169 | +``` |
| 170 | + |
| 171 | +**Benefits**: |
| 172 | +- More explicit and maintainable |
| 173 | +- Validates entire header row, not just one column |
| 174 | +- Correctly handles edge cases (e.g., 2018 "Summary of Patient Recruitment" in row above) |
| 175 | +- Year-based guard prevents false positives |
| 176 | + |
| 177 | +**Performance**: No change (both checks are negligible vs. I/O time) |
| 178 | + |
| 179 | +## Code Coverage |
| 180 | + |
| 181 | +- **patient.py**: 94% coverage |
| 182 | +- **All extraction tests**: 10/10 passing |
| 183 | +- **Parameterized tests**: Validate 2018 (Dec), 2019 (Jan/Feb/Mar/Oct), and 2024 (Jan) |
| 184 | +- **Year coverage**: Tests single-line (2018) and multi-line (2019+) header formats |
| 185 | + |
| 186 | +## Successful Optimization - Single-Pass Extraction (2025-10-23) |
| 187 | + |
| 188 | +### Problem |
| 189 | +Original implementation used two-pass approach: |
| 190 | +1. Load workbook in structure mode to detect merged cells (1.95s) |
| 191 | +2. Load workbook in read-only mode for fast data reading (0.29s) |
| 192 | + |
| 193 | +**Total time**: ~2.3s average per sheet |
| 194 | + |
| 195 | +### Solution |
| 196 | +Implemented **single-pass read-only** extraction with **forward-fill logic** for horizontally merged cells: |
| 197 | + |
| 198 | +```python |
| 199 | +# Track previous h2 for horizontal merges |
| 200 | +prev_h2 = None |
| 201 | +for h1, h2 in zip(header_1, header_2, strict=True): |
| 202 | + if h1 and h2: |
| 203 | + headers.append(f"{h2} {h1}".strip()) |
| 204 | + prev_h2 = h2 |
| 205 | + elif h2: |
| 206 | + headers.append(str(h2).strip()) |
| 207 | + prev_h2 = h2 |
| 208 | + elif h1: |
| 209 | + if prev_h2: |
| 210 | + # Horizontally merged cell: fill forward |
| 211 | + headers.append(f"{prev_h2} {h1}".strip()) |
| 212 | + else: |
| 213 | + headers.append(str(h1).strip()) |
| 214 | + else: |
| 215 | + headers.append(None) |
| 216 | + prev_h2 = None |
| 217 | +``` |
| 218 | + |
| 219 | +### Key Insight |
| 220 | +- Vertically merged cells (spanning rows): Read-only mode can read these directly - no special handling needed |
| 221 | +- Horizontally merged cells (spanning columns): Excel sets cell value only in first column, subsequent columns are None |
| 222 | +- **Solution**: Fill forward from previous column when h2=None but h1 exists |
| 223 | + |
| 224 | +### Example |
| 225 | +``` |
| 226 | +Col 12: h2="Updated HbA1c", h1="%" → "Updated HbA1c %" |
| 227 | +Col 13: h2=None (merged), h1="(dd-mmm-yyyy)" → "Updated HbA1c (dd-mmm-yyyy)" |
| 228 | +``` |
| 229 | + |
| 230 | +### Performance Results |
| 231 | +| Tracker | Before (two-pass) | After (single-pass) | Improvement | |
| 232 | +|---------|-------------------|---------------------|-------------| |
| 233 | +| 2024 | 2.609s | 0.877s | **66% faster** | |
| 234 | +| 2019 | 2.122s | 0.080s | **96% faster** | |
| 235 | + |
| 236 | +### Data Correctness Validation |
| 237 | +- ✅ All 10 tests pass |
| 238 | +- ✅ Correct column counts: 31 (2024), 25/28/27/27 (2019), 19 (2018) |
| 239 | +- ✅ Proper header names including horizontally merged cells |
| 240 | +- ✅ Patient IDs validated: MY_SU001-004 |
| 241 | + |
| 242 | +### Lessons Learned |
| 243 | +1. **Always verify assumptions**: Initial assumption that merged cells can't be read in read-only mode was incorrect |
| 244 | +2. **Question complexity**: The two-pass approach was solving a problem (vertical merges) that didn't exist |
| 245 | +3. **Root cause analysis**: The real challenge was horizontal merges, which required forward-fill logic |
| 246 | +4. **Data-first approach**: Never change test expectations to match wrong output - fix the code instead |
0 commit comments