|
| 1 | +# LAB100/101 fixture regression — root-cause diagnosis (Phase-B, Round 5) |
| 2 | + |
| 3 | +Read-only investigation of why un-quarantining |
| 4 | +`fixtures/30_sp_coverage/zz_lab100_101_fill.sql` regresses SIX tables to empty |
| 5 | +(covid_contact_datamart, covid_vaccination_datamart, f_contact_record_case, |
| 6 | +f_vaccination, lab100, lab_rpt_user_comment). |
| 7 | + |
| 8 | +The original "IDENTITY flood on `observation`/`act`" explanation (LESSON 10) is |
| 9 | +**wrong** — `observation` and `act` are not IDENTITY columns, and the fixture |
| 10 | +uses explicit non-colliding UIDs (22053xxx). The real cause is different and is |
| 11 | +proven below. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## TL;DR root cause |
| 16 | + |
| 17 | +The pipeline is now **fully CDC/service-driven** (the manual `EXEC sp_*` chain |
| 18 | +functions `run_lab_chain`/`run_vaccination_chain`/`run_contact_chain`/… in |
| 19 | +`scripts/merge_and_verify.sh` are **dead code — never called by `main()`**; |
| 20 | +`main()` only applies fixtures tier-by-tier and lets the |
| 21 | +reporting-pipeline-service consume CDC events). The service batches all CDC ids |
| 22 | +that arrive in a ~20 s window (`service.fixed-delay.cached-ids = 20000`, |
| 23 | +`application.yaml:106`) and processes them in one `processIdCache(...)` pass. |
| 24 | + |
| 25 | +That pass has a **fail-fast short-circuit**: the first entity whose stored proc |
| 26 | +throws sets `processingFailed = true` and **all lower-priority entities in the |
| 27 | +same batch are skipped** (pushed to the retry/dead-letter cache), and the |
| 28 | +datamart fan-out (`dmProcessor.process(dmData)`) only ever sees the `dmData` |
| 29 | +accumulated *before* the failure. See |
| 30 | +`PostProcessingService.processIdCache` lines 749-799: |
| 31 | + |
| 32 | +```java |
| 33 | +boolean processingFailed = false; |
| 34 | +for (Entry<String,List<Long>> entry : sortedEntries) { // sorted by Entity.priority ASC |
| 35 | + if (processingFailed) { ...retryCache.add(ids); continue; } // SKIP the rest |
| 36 | + try { processEntity(...); } |
| 37 | + catch (Exception e) { processingFailed = true; buildRetryCache(...); } |
| 38 | +} |
| 39 | +... |
| 40 | +dmProcessor.process(dmData); // only the pre-failure dmData |
| 41 | +``` |
| 42 | + |
| 43 | +`Entity` priority order (`Entity.java:22-25`): |
| 44 | + |
| 45 | +| priority | entity | builds | |
| 46 | +| --- | --- | --- | |
| 47 | +| 14 | **OBSERVATION** | LAB_TEST / LAB_RESULT_VAL / **lab100** / **lab_rpt_user_comment** / lab101 (and morb report) | |
| 48 | +| 15 | **CONTACT** | D_CONTACT_RECORD / **f_contact_record_case** | |
| 49 | +| 16 | TREATMENT | treatment | |
| 50 | +| 17 | **VACCINATION** | D_VACCINATION / **f_vaccination** | |
| 51 | + |
| 52 | +`processObservation` (priority 14) runs **morb first, then lab** in one call |
| 53 | +(`PostProcessingService.java:1234-1278`). When OBSERVATION throws, CONTACT(15), |
| 54 | +TREATMENT(16) and VACCINATION(17) are all skipped for that batch, and the |
| 55 | +covid_contact / covid_vaccination datamarts (which read the |
| 56 | +D_CONTACT_RECORD / D_/F_VACCINATION dims those skipped entities build) produce 0 |
| 57 | +rows. That is exactly the 6-table set. |
| 58 | + |
| 59 | +**The throw that triggers the short-circuit already exists *without* the lab |
| 60 | +fixture.** It is `sp_nrt_morbidity_report_postprocessing` failing on the |
| 61 | +foundation morb obs: |
| 62 | + |
| 63 | +``` |
| 64 | +Step: Inserting into MORBIDITY_REPORT_EVENT |
| 65 | +Error 515: Cannot insert the value NULL into column 'PATIENT_KEY', |
| 66 | + table 'RDB_MODERN.dbo.MORBIDITY_REPORT_EVENT' |
| 67 | +``` |
| 68 | +(live `RDB_MODERN.dbo.job_flow_log`, batches at 06:35 / 06:37 / 06:44 on the |
| 69 | +CURRENT lab-quarantined run; and the matching service log |
| 70 | +`Error processing observation data with ids '…': DataProcessingException` |
| 71 | +retried attempts #1-3 then dead-lettered.) |
| 72 | + |
| 73 | +So the morb obs SP throws on *every* batch that contains it. Whether the 6 |
| 74 | +tables end up populated is therefore decided by **CDC batch timing** — i.e. whether |
| 75 | +the CONTACT/VACCINATION CDC ids happen to land in a *different* 20 s window than |
| 76 | +the failing morb-observation batch (then they process fine), or in the *same* |
| 77 | +window (then they are skipped). This is precisely the long-noted |
| 78 | +"covid_contact / f_contact / f_vaccination flakiness" (LOOP_round4 LESSON 8 area; |
| 79 | +LOOP_round5 item D). |
| 80 | + |
| 81 | +**What the lab fixture does is make the bad outcome (near-)deterministic.** It |
| 82 | +adds ~80 new lab observations (Order/Result + I_Order/I_Result/Result + 35 |
| 83 | +LABxxx children) to NBS_ODSE → CDC mirrors them to `nrt_observation` in the |
| 84 | +Tier-3 drain → they swell the OBSERVATION ids in the same drain window that |
| 85 | +also carries the morb obs + the contact/vaccination ids. The bigger, slower |
| 86 | +observation batch (a) keeps OBSERVATION failing (morb still throws, and the lab |
| 87 | +SPs add their own failure surface — see below), and (b) widens the window in |
| 88 | +which CONTACT/VACCINATION co-batch with the failing OBSERVATION entity, so they |
| 89 | +get skipped → the 6 tables go to 0 every run instead of "sometimes". Quarantining |
| 90 | +the fixture shrinks the observation batch back to the flaky-but-often-OK size. |
| 91 | + |
| 92 | +### Net: the lab fixture is a *deterministic aggravator* of a *pre-existing* |
| 93 | +### batch-failure/short-circuit bug, NOT a UID collision and NOT an IDENTITY flood. |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## Evidence per hypothesis |
| 98 | + |
| 99 | +**H1 — UID collision: NO.** The fixture's explicit UIDs (act/observation/act_id/ |
| 100 | +participation/act_relationship 22053010-22053011, 22053500-22053502, |
| 101 | +22053600-22053634) do not exist in NBS_ODSE today and do not appear in any |
| 102 | +hardcoded `@obs` list in `merge_and_verify.sh`. Its `local_id`s |
| 103 | +(`OBS22053010GA01`, `OBS22053500GA01`) do not collide with any existing |
| 104 | +observation local_id (live `NBS_ODSE.dbo.observation` survey: existing ids are |
| 105 | +OBS20000120GA01 … OBS22048216GA01). `program_jurisdiction_oid` 20053010/20053500 |
| 106 | +are unique. No key conflict feeds the 6 tables' chains. **Ruled out.** |
| 107 | + |
| 108 | +**H2 — shared-entity interference: NO (not the cause).** The fixture *does* |
| 109 | +re-use the shared foundation entities (`@pat_uid=20000000`, `@prov_uid=20000010`, |
| 110 | +`@org_uid=20000020`) for its PATSBJ/ORD/AUT/SPP participations, and adds a `role` |
| 111 | +row (SPP) scoped to the patient. But those are additive participations on *new* |
| 112 | +acts; they do not change the join cardinality of the contact/vaccination/morb |
| 113 | +chains (which key off their own acts 20120010 / 20110010 / 20080010). The 6 |
| 114 | +tables break because their *entities are skipped*, not because a shared join |
| 115 | +multiplied. **Not the mechanism.** |
| 116 | + |
| 117 | +**H3 — batch-failure / shared-SP short-circuit: YES — this is the root cause.** |
| 118 | +Proven above: `processIdCache` fail-fast (lines 749-799) + `Entity` priority |
| 119 | +14<15<16<17 + `processObservation` running the throwing morb SP at priority 14 → |
| 120 | +CONTACT/TREATMENT/VACCINATION skipped + truncated `dmData` for the datamart |
| 121 | +fan-out. Confirmed live by the recurring Error-515 in `job_flow_log` and the |
| 122 | +`DataProcessingException` retries in the service log on the CURRENT |
| 123 | +lab-quarantined run, with covid_contact_datamart=0, f_contact_record_case=0, |
| 124 | +lab_rpt_user_comment=0 right now. |
| 125 | + |
| 126 | +Additional lab-side throw surface that the fixture *adds* to the same OBSERVATION |
| 127 | +batch (so even if the morb obs were fixed, the lab fixture could still trip the |
| 128 | +short-circuit): |
| 129 | +- `sp_d_labtest_result_postprocessing` (routine 017) has a TRY/CATCH that |
| 130 | + re-raises on any error (lines ~1955+), so any lab-row defect propagates as a |
| 131 | + service exception and short-circuits the batch. |
| 132 | +- `sp_lab101_datamart_postprocessing` (routine 020) does **hard** |
| 133 | + `convert(datetime, replace(LAB.LABn,'-',' '), 0)` for the 7 date pivot |
| 134 | + columns LAB6/21/22/28/29/33/34 (lines 839-845), sourced from |
| 135 | + `LAB_RESULT_VAL.FROM_TIME` for cds LAB334/349/350/356/357/361/362 |
| 136 | + (trtd6/21/22/28/29/33/34 join filters, lines 535-679). The fixture seeds |
| 137 | + `obs_value_coded.display_name='04-21-2026'` for cds |
| 138 | + `LAB334,LAB349,LAB354,LAB360,LAB361,LAB362,LAB363` — note `LAB354/360` are |
| 139 | + **not** date columns while `LAB350/356/357` **are** and get plain text |
| 140 | + (`'Value for LAB350'` …). The fixture relies on `FROM_TIME` being NULL for the |
| 141 | + text children (so the date CONVERT sees NULL and yields NULL); that holds only |
| 142 | + because the fixture sets no `obs_value_date`. This is fragile — any future edit |
| 143 | + that lands a non-date string into a FROM_TIME-mapped LABn column would make |
| 144 | + routine 020 throw and short-circuit the batch on its own. (Not the *current* |
| 145 | + trigger — the morb obs throws first — but a latent second one.) |
| 146 | + |
| 147 | +**H4 — apply order / IF-NOT-EXISTS pre-occupation: NO.** All Tier-3 fixtures are |
| 148 | +applied in Step 8 *before* the single Step-9 drain (`main()` lines 696-701), so |
| 149 | +CDC mirrors the entire Tier-3 set together regardless of the `zz_*` |
| 150 | +alphabetical position of `zz_lab100_101_fill.sql`. The fixture's only guard |
| 151 | +(`IF NOT EXISTS … role … SPP`) touches a shared role, not anything the 6 tables |
| 152 | +read. Order is not the lever. **Ruled out.** |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +## Which of the 6 break, and why |
| 157 | + |
| 158 | +| table | builder | why it goes to 0 when the OBSERVATION batch fails | |
| 159 | +| --- | --- | --- | |
| 160 | +| **lab100** | `sp_d_lab_test_postprocessing` (LAB path of `processObservation`, prio 14, runs AFTER the throwing morb SP) | morb throws first → lab SPs in the same `processObservation` call never run | |
| 161 | +| **lab_rpt_user_comment** | same `sp_d_lab_test_postprocessing` (routine 018) | same — lab path skipped | |
| 162 | +| **f_contact_record_case** | CONTACT entity (prio 15), `sp_f_contact_record_case_postprocessing` | priority 15 > 14 → skipped after OBSERVATION fails | |
| 163 | +| **f_vaccination** | VACCINATION entity (prio 17), `sp_f_vaccination_postprocessing` | priority 17 > 14 → skipped | |
| 164 | +| **covid_contact_datamart** | `nbs_Datamart` topic → dmCache, but reads D_CONTACT_RECORD built by CONTACT(15) | CONTACT skipped → no/stale dim → datamart SP yields 0 | |
| 165 | +| **covid_vaccination_datamart** | `nbs_Datamart` topic → dmCache, but reads D_/F_VACCINATION built by VACCINATION(17) | VACCINATION skipped → datamart SP yields 0 | |
| 166 | + |
| 167 | +Without the lab fixture these can populate when CONTACT/VACCINATION happen to |
| 168 | +batch separately from the failing morb obs (timing) — hence the historical |
| 169 | +"flaky +42/+39" swings. With the lab fixture the failing observation batch is |
| 170 | +big and slow enough that the co-batch (and thus the skip) is effectively every |
| 171 | +run. |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## SAFE-REWORK VERDICT |
| 176 | + |
| 177 | +**Safe to author lab100/lab101 — but ONLY with the changes below. As written + |
| 178 | +the proposed ORCH_TODO it is NOT safe**, because the regression is not about the |
| 179 | +lab data per se; it is about the lab obs joining a shared, fail-fast CDC batch |
| 180 | +whose morb obs already throws. A non-colliding UID range / dedicated act |
| 181 | +(hypotheses the brief floated) does **not** help, because the failure is the |
| 182 | +shared-batch short-circuit, not a UID/entity collision. |
| 183 | + |
| 184 | +There are two independent fixes; do the harness one first (it removes the whole |
| 185 | +class of flakiness), the lab-data hardening second. |
| 186 | + |
| 187 | +### Fix A (REQUIRED, harness — sanctioned this round): stop one throwing observation from nuking the batch |
| 188 | +The pre-existing `MORBIDITY_REPORT_EVENT.PATIENT_KEY` NULL throw is what trips |
| 189 | +the short-circuit today; the lab fixture only makes it deterministic. Options, |
| 190 | +best first: |
| 191 | +1. **Make the morb foundation obs stop throwing.** The morb obs (20080010 / |
| 192 | + 20000130) reaches `sp_nrt_morbidity_report_postprocessing` with a NULL |
| 193 | + PATIENT_KEY (no resolvable patient link), violating the NOT NULL on |
| 194 | + `MORBIDITY_REPORT_EVENT.PATIENT_KEY`. Author the missing patient link |
| 195 | + (a `PATSBJ`/subject participation resolving to D_PATIENT) for the morb obs in |
| 196 | + a Tier-1/Tier-2 fixture so the morb SP succeeds → OBSERVATION no longer |
| 197 | + throws → CONTACT/VACCINATION never get skipped → the 6 tables stop being |
| 198 | + flaky AND the lab fixture can ride along safely. This is the highest-leverage |
| 199 | + fix (also stabilizes LOOP item D). |
| 200 | +2. If the morb obs cannot be made to succeed via fixtures, **isolate lab obs |
| 201 | + from the morb/contact/vaccination batch** by introducing a dedicated post- |
| 202 | + Step-9 drain step that re-emits ONLY the lab obs ids (so they process in a |
| 203 | + batch that contains no throwing morb obs). This is a `merge_and_verify.sh` |
| 204 | + edit (sanctioned). It does NOT cure the underlying flakiness (contact/vacc |
| 205 | + can still co-batch with morb), but it makes the lab fixture coverage-additive |
| 206 | + and non-regressing. |
| 207 | + |
| 208 | + Note the dead manual-chain functions (`run_lab_chain`, etc.) are NOT executed |
| 209 | + by `main()` — adding the fixture's UIDs to them (the fixture-header ORCH_TODO) |
| 210 | + would be a **no-op** and must NOT be relied on. Any harness orchestration for |
| 211 | + lab100/101 has to be wired into `main()`’s real (CDC-drain) flow, e.g. a |
| 212 | + genuine post-Step-9 re-drain or a real `EXEC` step, not the dead helpers. |
| 213 | + |
| 214 | +### Fix B (REQUIRED, lab fixture data hardening): make the lab SPs unable to throw |
| 215 | +Even after Fix A, harden the fixture so routines 017/020 can't throw on it: |
| 216 | +- **Align the date children to routine 020's actual date columns.** The |
| 217 | + FROM_TIME (datetime) pivot columns are LAB6/21/22/28/29/33/34 = cds |
| 218 | + **LAB334, LAB349, LAB350, LAB356, LAB357, LAB361, LAB362** (routine 020 lines |
| 219 | + 535-679). The fixture currently date-flags `LAB334,LAB349,LAB354,LAB360, |
| 220 | + LAB361,LAB362,LAB363` — i.e. it dates two NON-date cols (LAB354,LAB360,LAB363) |
| 221 | + and leaves three real date cols (LAB350,LAB356,LAB357) as text. Fix the |
| 222 | + CASE list to exactly `{LAB334,LAB349,LAB350,LAB356,LAB357,LAB361,LAB362}`. |
| 223 | +- **Feed the date through a channel routine 020 actually reads as FROM_TIME.** |
| 224 | + The date columns come from `LAB_RESULT_VAL.FROM_TIME`, which |
| 225 | + `sp_d_labtest_result_postprocessing` derives from the observation's date value |
| 226 | + — not from `obs_value_coded.display_name`. Author `obs_value_date` (or the |
| 227 | + obs scalar the SP maps to FROM_TIME) for those 7 children with a valid |
| 228 | + datetime, and give the text children NO FROM_TIME. Then routine 020's |
| 229 | + `convert(datetime, …)` only ever sees real datetimes or NULL — never a stray |
| 230 | + string — so it cannot throw regardless of batch composition. |
| 231 | +- Keep `obs_value_numeric.low_range/high_range` as the current strings |
| 232 | + (`'0.00'`,`'0.90'`) — those columns are `varchar(20)` (verified), so they are |
| 233 | + safe; routine 017 only `SUBSTRING`s them. |
| 234 | + |
| 235 | +### Validation expectation |
| 236 | +With Fix A (morb link) + Fix B (lab date alignment), a clean |
| 237 | +`merge_and_verify` should: keep lab100 ≥ its current 1 and add the new |
| 238 | +demographics row, populate lab101 (>0), keep f_vaccination / f_contact_record_case / |
| 239 | +covid_vaccination_datamart / covid_contact_datamart at or above their pre-fixture |
| 240 | +counts (no longer zeroed), and show the observation batch draining to idle with |
| 241 | +**no** `DataProcessingException` / Error-515 in the service log + `job_flow_log`. |
| 242 | +If the morb link (Fix A.1) can't be done fixtures-only, use the isolation |
| 243 | +drain (Fix A.2) and accept that contact/vacc remain timing-flaky independent of |
| 244 | +the lab fixture. |
0 commit comments