Skip to content

Commit c19cc57

Browse files
committed
Loop r5: LESSON 12 - morb Error-515 + fail-fast skip is the keystone (flakiness + lab regression + #26 all root here)
1 parent 4dcbcb6 commit c19cc57

2 files changed

Lines changed: 259 additions & 0 deletions

File tree

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# LAB100/101 fixture regression — root-cause diagnosis (Phase-B, Round 5)
2+
3+
Read-only investigation of why un-quarantining
4+
`fixtures/30_sp_coverage/zz_lab100_101_fill.sql` regresses SIX tables to empty
5+
(covid_contact_datamart, covid_vaccination_datamart, f_contact_record_case,
6+
f_vaccination, lab100, lab_rpt_user_comment).
7+
8+
The original "IDENTITY flood on `observation`/`act`" explanation (LESSON 10) is
9+
**wrong**`observation` and `act` are not IDENTITY columns, and the fixture
10+
uses explicit non-colliding UIDs (22053xxx). The real cause is different and is
11+
proven below.
12+
13+
---
14+
15+
## TL;DR root cause
16+
17+
The pipeline is now **fully CDC/service-driven** (the manual `EXEC sp_*` chain
18+
functions `run_lab_chain`/`run_vaccination_chain`/`run_contact_chain`/… in
19+
`scripts/merge_and_verify.sh` are **dead code — never called by `main()`**;
20+
`main()` only applies fixtures tier-by-tier and lets the
21+
reporting-pipeline-service consume CDC events). The service batches all CDC ids
22+
that arrive in a ~20 s window (`service.fixed-delay.cached-ids = 20000`,
23+
`application.yaml:106`) and processes them in one `processIdCache(...)` pass.
24+
25+
That pass has a **fail-fast short-circuit**: the first entity whose stored proc
26+
throws sets `processingFailed = true` and **all lower-priority entities in the
27+
same batch are skipped** (pushed to the retry/dead-letter cache), and the
28+
datamart fan-out (`dmProcessor.process(dmData)`) only ever sees the `dmData`
29+
accumulated *before* the failure. See
30+
`PostProcessingService.processIdCache` lines 749-799:
31+
32+
```java
33+
boolean processingFailed = false;
34+
for (Entry<String,List<Long>> entry : sortedEntries) { // sorted by Entity.priority ASC
35+
if (processingFailed) { ...retryCache.add(ids); continue; } // SKIP the rest
36+
try { processEntity(...); }
37+
catch (Exception e) { processingFailed = true; buildRetryCache(...); }
38+
}
39+
...
40+
dmProcessor.process(dmData); // only the pre-failure dmData
41+
```
42+
43+
`Entity` priority order (`Entity.java:22-25`):
44+
45+
| priority | entity | builds |
46+
| --- | --- | --- |
47+
| 14 | **OBSERVATION** | LAB_TEST / LAB_RESULT_VAL / **lab100** / **lab_rpt_user_comment** / lab101 (and morb report) |
48+
| 15 | **CONTACT** | D_CONTACT_RECORD / **f_contact_record_case** |
49+
| 16 | TREATMENT | treatment |
50+
| 17 | **VACCINATION** | D_VACCINATION / **f_vaccination** |
51+
52+
`processObservation` (priority 14) runs **morb first, then lab** in one call
53+
(`PostProcessingService.java:1234-1278`). When OBSERVATION throws, CONTACT(15),
54+
TREATMENT(16) and VACCINATION(17) are all skipped for that batch, and the
55+
covid_contact / covid_vaccination datamarts (which read the
56+
D_CONTACT_RECORD / D_/F_VACCINATION dims those skipped entities build) produce 0
57+
rows. That is exactly the 6-table set.
58+
59+
**The throw that triggers the short-circuit already exists *without* the lab
60+
fixture.** It is `sp_nrt_morbidity_report_postprocessing` failing on the
61+
foundation morb obs:
62+
63+
```
64+
Step: Inserting into MORBIDITY_REPORT_EVENT
65+
Error 515: Cannot insert the value NULL into column 'PATIENT_KEY',
66+
table 'RDB_MODERN.dbo.MORBIDITY_REPORT_EVENT'
67+
```
68+
(live `RDB_MODERN.dbo.job_flow_log`, batches at 06:35 / 06:37 / 06:44 on the
69+
CURRENT lab-quarantined run; and the matching service log
70+
`Error processing observation data with ids '…': DataProcessingException`
71+
retried attempts #1-3 then dead-lettered.)
72+
73+
So the morb obs SP throws on *every* batch that contains it. Whether the 6
74+
tables end up populated is therefore decided by **CDC batch timing** — i.e. whether
75+
the CONTACT/VACCINATION CDC ids happen to land in a *different* 20 s window than
76+
the failing morb-observation batch (then they process fine), or in the *same*
77+
window (then they are skipped). This is precisely the long-noted
78+
"covid_contact / f_contact / f_vaccination flakiness" (LOOP_round4 LESSON 8 area;
79+
LOOP_round5 item D).
80+
81+
**What the lab fixture does is make the bad outcome (near-)deterministic.** It
82+
adds ~80 new lab observations (Order/Result + I_Order/I_Result/Result + 35
83+
LABxxx children) to NBS_ODSE → CDC mirrors them to `nrt_observation` in the
84+
Tier-3 drain → they swell the OBSERVATION ids in the same drain window that
85+
also carries the morb obs + the contact/vaccination ids. The bigger, slower
86+
observation batch (a) keeps OBSERVATION failing (morb still throws, and the lab
87+
SPs add their own failure surface — see below), and (b) widens the window in
88+
which CONTACT/VACCINATION co-batch with the failing OBSERVATION entity, so they
89+
get skipped → the 6 tables go to 0 every run instead of "sometimes". Quarantining
90+
the fixture shrinks the observation batch back to the flaky-but-often-OK size.
91+
92+
### Net: the lab fixture is a *deterministic aggravator* of a *pre-existing*
93+
### batch-failure/short-circuit bug, NOT a UID collision and NOT an IDENTITY flood.
94+
95+
---
96+
97+
## Evidence per hypothesis
98+
99+
**H1 — UID collision: NO.** The fixture's explicit UIDs (act/observation/act_id/
100+
participation/act_relationship 22053010-22053011, 22053500-22053502,
101+
22053600-22053634) do not exist in NBS_ODSE today and do not appear in any
102+
hardcoded `@obs` list in `merge_and_verify.sh`. Its `local_id`s
103+
(`OBS22053010GA01`, `OBS22053500GA01`) do not collide with any existing
104+
observation local_id (live `NBS_ODSE.dbo.observation` survey: existing ids are
105+
OBS20000120GA01 … OBS22048216GA01). `program_jurisdiction_oid` 20053010/20053500
106+
are unique. No key conflict feeds the 6 tables' chains. **Ruled out.**
107+
108+
**H2 — shared-entity interference: NO (not the cause).** The fixture *does*
109+
re-use the shared foundation entities (`@pat_uid=20000000`, `@prov_uid=20000010`,
110+
`@org_uid=20000020`) for its PATSBJ/ORD/AUT/SPP participations, and adds a `role`
111+
row (SPP) scoped to the patient. But those are additive participations on *new*
112+
acts; they do not change the join cardinality of the contact/vaccination/morb
113+
chains (which key off their own acts 20120010 / 20110010 / 20080010). The 6
114+
tables break because their *entities are skipped*, not because a shared join
115+
multiplied. **Not the mechanism.**
116+
117+
**H3 — batch-failure / shared-SP short-circuit: YES — this is the root cause.**
118+
Proven above: `processIdCache` fail-fast (lines 749-799) + `Entity` priority
119+
14<15<16<17 + `processObservation` running the throwing morb SP at priority 14 →
120+
CONTACT/TREATMENT/VACCINATION skipped + truncated `dmData` for the datamart
121+
fan-out. Confirmed live by the recurring Error-515 in `job_flow_log` and the
122+
`DataProcessingException` retries in the service log on the CURRENT
123+
lab-quarantined run, with covid_contact_datamart=0, f_contact_record_case=0,
124+
lab_rpt_user_comment=0 right now.
125+
126+
Additional lab-side throw surface that the fixture *adds* to the same OBSERVATION
127+
batch (so even if the morb obs were fixed, the lab fixture could still trip the
128+
short-circuit):
129+
- `sp_d_labtest_result_postprocessing` (routine 017) has a TRY/CATCH that
130+
re-raises on any error (lines ~1955+), so any lab-row defect propagates as a
131+
service exception and short-circuits the batch.
132+
- `sp_lab101_datamart_postprocessing` (routine 020) does **hard**
133+
`convert(datetime, replace(LAB.LABn,'-',' '), 0)` for the 7 date pivot
134+
columns LAB6/21/22/28/29/33/34 (lines 839-845), sourced from
135+
`LAB_RESULT_VAL.FROM_TIME` for cds LAB334/349/350/356/357/361/362
136+
(trtd6/21/22/28/29/33/34 join filters, lines 535-679). The fixture seeds
137+
`obs_value_coded.display_name='04-21-2026'` for cds
138+
`LAB334,LAB349,LAB354,LAB360,LAB361,LAB362,LAB363` — note `LAB354/360` are
139+
**not** date columns while `LAB350/356/357` **are** and get plain text
140+
(`'Value for LAB350'` …). The fixture relies on `FROM_TIME` being NULL for the
141+
text children (so the date CONVERT sees NULL and yields NULL); that holds only
142+
because the fixture sets no `obs_value_date`. This is fragile — any future edit
143+
that lands a non-date string into a FROM_TIME-mapped LABn column would make
144+
routine 020 throw and short-circuit the batch on its own. (Not the *current*
145+
trigger — the morb obs throws first — but a latent second one.)
146+
147+
**H4 — apply order / IF-NOT-EXISTS pre-occupation: NO.** All Tier-3 fixtures are
148+
applied in Step 8 *before* the single Step-9 drain (`main()` lines 696-701), so
149+
CDC mirrors the entire Tier-3 set together regardless of the `zz_*`
150+
alphabetical position of `zz_lab100_101_fill.sql`. The fixture's only guard
151+
(`IF NOT EXISTS … role … SPP`) touches a shared role, not anything the 6 tables
152+
read. Order is not the lever. **Ruled out.**
153+
154+
---
155+
156+
## Which of the 6 break, and why
157+
158+
| table | builder | why it goes to 0 when the OBSERVATION batch fails |
159+
| --- | --- | --- |
160+
| **lab100** | `sp_d_lab_test_postprocessing` (LAB path of `processObservation`, prio 14, runs AFTER the throwing morb SP) | morb throws first → lab SPs in the same `processObservation` call never run |
161+
| **lab_rpt_user_comment** | same `sp_d_lab_test_postprocessing` (routine 018) | same — lab path skipped |
162+
| **f_contact_record_case** | CONTACT entity (prio 15), `sp_f_contact_record_case_postprocessing` | priority 15 > 14 → skipped after OBSERVATION fails |
163+
| **f_vaccination** | VACCINATION entity (prio 17), `sp_f_vaccination_postprocessing` | priority 17 > 14 → skipped |
164+
| **covid_contact_datamart** | `nbs_Datamart` topic → dmCache, but reads D_CONTACT_RECORD built by CONTACT(15) | CONTACT skipped → no/stale dim → datamart SP yields 0 |
165+
| **covid_vaccination_datamart** | `nbs_Datamart` topic → dmCache, but reads D_/F_VACCINATION built by VACCINATION(17) | VACCINATION skipped → datamart SP yields 0 |
166+
167+
Without the lab fixture these can populate when CONTACT/VACCINATION happen to
168+
batch separately from the failing morb obs (timing) — hence the historical
169+
"flaky +42/+39" swings. With the lab fixture the failing observation batch is
170+
big and slow enough that the co-batch (and thus the skip) is effectively every
171+
run.
172+
173+
---
174+
175+
## SAFE-REWORK VERDICT
176+
177+
**Safe to author lab100/lab101 — but ONLY with the changes below. As written +
178+
the proposed ORCH_TODO it is NOT safe**, because the regression is not about the
179+
lab data per se; it is about the lab obs joining a shared, fail-fast CDC batch
180+
whose morb obs already throws. A non-colliding UID range / dedicated act
181+
(hypotheses the brief floated) does **not** help, because the failure is the
182+
shared-batch short-circuit, not a UID/entity collision.
183+
184+
There are two independent fixes; do the harness one first (it removes the whole
185+
class of flakiness), the lab-data hardening second.
186+
187+
### Fix A (REQUIRED, harness — sanctioned this round): stop one throwing observation from nuking the batch
188+
The pre-existing `MORBIDITY_REPORT_EVENT.PATIENT_KEY` NULL throw is what trips
189+
the short-circuit today; the lab fixture only makes it deterministic. Options,
190+
best first:
191+
1. **Make the morb foundation obs stop throwing.** The morb obs (20080010 /
192+
20000130) reaches `sp_nrt_morbidity_report_postprocessing` with a NULL
193+
PATIENT_KEY (no resolvable patient link), violating the NOT NULL on
194+
`MORBIDITY_REPORT_EVENT.PATIENT_KEY`. Author the missing patient link
195+
(a `PATSBJ`/subject participation resolving to D_PATIENT) for the morb obs in
196+
a Tier-1/Tier-2 fixture so the morb SP succeeds → OBSERVATION no longer
197+
throws → CONTACT/VACCINATION never get skipped → the 6 tables stop being
198+
flaky AND the lab fixture can ride along safely. This is the highest-leverage
199+
fix (also stabilizes LOOP item D).
200+
2. If the morb obs cannot be made to succeed via fixtures, **isolate lab obs
201+
from the morb/contact/vaccination batch** by introducing a dedicated post-
202+
Step-9 drain step that re-emits ONLY the lab obs ids (so they process in a
203+
batch that contains no throwing morb obs). This is a `merge_and_verify.sh`
204+
edit (sanctioned). It does NOT cure the underlying flakiness (contact/vacc
205+
can still co-batch with morb), but it makes the lab fixture coverage-additive
206+
and non-regressing.
207+
208+
Note the dead manual-chain functions (`run_lab_chain`, etc.) are NOT executed
209+
by `main()` — adding the fixture's UIDs to them (the fixture-header ORCH_TODO)
210+
would be a **no-op** and must NOT be relied on. Any harness orchestration for
211+
lab100/101 has to be wired into `main()`’s real (CDC-drain) flow, e.g. a
212+
genuine post-Step-9 re-drain or a real `EXEC` step, not the dead helpers.
213+
214+
### Fix B (REQUIRED, lab fixture data hardening): make the lab SPs unable to throw
215+
Even after Fix A, harden the fixture so routines 017/020 can't throw on it:
216+
- **Align the date children to routine 020's actual date columns.** The
217+
FROM_TIME (datetime) pivot columns are LAB6/21/22/28/29/33/34 = cds
218+
**LAB334, LAB349, LAB350, LAB356, LAB357, LAB361, LAB362** (routine 020 lines
219+
535-679). The fixture currently date-flags `LAB334,LAB349,LAB354,LAB360,
220+
LAB361,LAB362,LAB363` — i.e. it dates two NON-date cols (LAB354,LAB360,LAB363)
221+
and leaves three real date cols (LAB350,LAB356,LAB357) as text. Fix the
222+
CASE list to exactly `{LAB334,LAB349,LAB350,LAB356,LAB357,LAB361,LAB362}`.
223+
- **Feed the date through a channel routine 020 actually reads as FROM_TIME.**
224+
The date columns come from `LAB_RESULT_VAL.FROM_TIME`, which
225+
`sp_d_labtest_result_postprocessing` derives from the observation's date value
226+
— not from `obs_value_coded.display_name`. Author `obs_value_date` (or the
227+
obs scalar the SP maps to FROM_TIME) for those 7 children with a valid
228+
datetime, and give the text children NO FROM_TIME. Then routine 020's
229+
`convert(datetime, …)` only ever sees real datetimes or NULL — never a stray
230+
string — so it cannot throw regardless of batch composition.
231+
- Keep `obs_value_numeric.low_range/high_range` as the current strings
232+
(`'0.00'`,`'0.90'`) — those columns are `varchar(20)` (verified), so they are
233+
safe; routine 017 only `SUBSTRING`s them.
234+
235+
### Validation expectation
236+
With Fix A (morb link) + Fix B (lab date alignment), a clean
237+
`merge_and_verify` should: keep lab100 ≥ its current 1 and add the new
238+
demographics row, populate lab101 (>0), keep f_vaccination / f_contact_record_case /
239+
covid_vaccination_datamart / covid_contact_datamart at or above their pre-fixture
240+
counts (no longer zeroed), and show the observation batch draining to idle with
241+
**no** `DataProcessingException` / Error-515 in the service log + `job_flow_log`.
242+
If the morb link (Fix A.1) can't be done fixtures-only, use the isolation
243+
drain (Fix A.2) and accept that contact/vacc remain timing-flaky independent of
244+
the lab fixture.

utilities/comparison-fixtures/LOOP_round5.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,3 +175,18 @@ fixture answers the guard's (act,Q) at a different group.)
175175
MIS-attributed to IDENTITY flood, but observation/act are NOT identity columns (survey) -> real cause
176176
unknown. Re-quarantined zz_lab100_101_fill.sql (.regresses-6-obs-tables-cause-TBD); spawned read-only
177177
B-diagnosis agent. Proceeding with Phase C (COVID dedicated patient+PHC, 22055xxx) in parallel.
178+
179+
## LESSON 12 (KEYSTONE): the morb-515 throw + fail-fast skip is the real flakiness/lab-regression cause
180+
LAB_REGRESSION_DIAGNOSIS.md: the service processes each ~20s CDC batch with a FAIL-FAST short-circuit
181+
(PostProcessingService.processIdCache:749-799) — the first entity whose SP THROWS sets
182+
processingFailed=true and ALL lower-priority entities are SKIPPED (priority OBSERVATION=14 < CONTACT=15
183+
< TREATMENT=16 < VACCINATION=17). The morb foundation observation ALREADY throws Error 515
184+
(MORBIDITY_REPORT_EVENT.PATIENT_KEY cannot be NULL) EVERY run (verified in job_flow_log). So whenever
185+
contact/vaccination co-batch with that throwing OBSERVATION, they're skipped -> THIS is why covid_contact
186+
/ d_contact_record / f_vaccination / d_place are "flaky" (not benign timing), and why the lab fixture's
187+
+80 obs deterministically zeroed 6 tables (bigger batch -> guaranteed co-batch).
188+
=> KEYSTONE FIX (item D + prerequisite for B + addresses #26): author the missing patient-subject link on
189+
the morb report so sp_nrt_morbidity_report_postprocessing succeeds -> OBSERVATION stops throwing ->
190+
contact/vaccination/lab never skipped -> those tables populate DETERMINISTICALLY. Lab fixture (B) also
191+
needs its date-children remapped to {LAB334,349,350,356,357,361,362} via the FROM_TIME channel (routine
192+
020 hard convert(datetime) throws otherwise). Spawned morb-fix agent (parallel with COVID-C).

0 commit comments

Comments
 (0)