Skip to content

Commit 1e4dfd7

Browse files
mmcdermottrvandewaterclaude
committed
Register the eICU dataset
eICU (multi-center US ICU) — `MEDS_extract-eICU` from the upstream `eICU-MEDS` package handles the extraction. Unlike the sister per-dataset PRs, eICU's upstream ships an end-to-end demo recipe (`do_demo=True`), so this PR keeps `demo_available` at its default of True and exercises the demo in the integration test matrix. Bumped the requirements pin from the contributor's `eICU-MEDS==0.0.1` to the latest `0.0.2` to pick up any demo / MEDS_transforms compatibility fixes; the 0.0.1 demo failed end-to-end in CI on the bundled PR #299 (shard_events stage couldn't find its parquet lock under demo mode). Also lands the shared `demo_available` registry mechanism (identical across sister per-dataset PRs). Replicated from the closed bundled PR #299 (which itself replicated @rvandewater's #258). Co-Authored-By: Robin P. van de Water <rvandewater@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ed1d56c commit 1e4dfd7

8 files changed

Lines changed: 250 additions & 3 deletions

File tree

src/MEDS_DEV/datasets/__init__.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,23 +38,28 @@ class DatasetMetadata(Metadata):
3838
access_policy: The level of accessibility of the dataset. Limited to the values in the AccessPolicy
3939
enum.
4040
access_details: A string describing the access policy in more detail. May be empty.
41+
demo_available: Whether the dataset has a usable ``build_demo`` recipe. Datasets where the upstream
42+
extractor offers no public demo set this to False; integration tests skip them and they are
43+
excluded from the CI dataset matrix.
4144
4245
Examples:
4346
>>> DatasetMetadata(description="foo", contacts=[{"name": "bar"}]) # doctest: +NORMALIZE_WHITESPACE
4447
DatasetMetadata(description='foo',
4548
contacts=[Contact(name='bar', email=None, github_username=None)],
4649
links=None,
4750
access_policy=<AccessPolicy.PRIVATE_SINGLE_USE: 'private_single_use'>,
48-
access_details=None)
51+
access_details=None,
52+
demo_available=True)
4953
>>> DatasetMetadata(
5054
... description="foo", contacts=[{"name": "bar"}], access_policy="public_unrestricted",
51-
... access_details="baz"
55+
... access_details="baz", demo_available=False
5256
... ) # doctest: +NORMALIZE_WHITESPACE
5357
DatasetMetadata(description='foo',
5458
contacts=[Contact(name='bar', email=None, github_username=None)],
5559
links=None,
5660
access_policy=<AccessPolicy.PUBLIC_UNRESTRICTED: 'public_unrestricted'>,
57-
access_details='baz')
61+
access_details='baz',
62+
demo_available=False)
5863
>>> DatasetMetadata(
5964
... description="foo", contacts=[{"name": "bar"}], access_policy="foo"
6065
... ) # doctest: +NORMALIZE_WHITESPACE
@@ -74,6 +79,7 @@ class DatasetMetadata(Metadata):
7479

7580
access_policy: AccessPolicy = AccessPolicy.PRIVATE_SINGLE_USE
7681
access_details: str | None = None
82+
demo_available: bool = True
7783

7884
def __post_init__(self):
7985
super().__post_init__()
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# eICU Collaborative Research Database
2+
3+
## Description
4+
5+
The eICU Collaborative Research Database is a multi-center database comprising de-identified health data associated with over 200,000 admissions to ICUs across the United States between 2014-2015. The database includes vital sign measurements, care plan documentation, severity of illness measures, diagnosis information, and treatment information. Data is collected through the Philips eICU program, a critical care telehealth program that delivers information to caregivers at the bedside.[1]
6+
7+
## Access Requirements
8+
9+
Taken from [PhysioNet](https://physionet.org/content/eicu-crd/2.0/):
10+
11+
- **Access Policy**: Complete the credentialed data access requirements on PhysioNet[2]
12+
- **License (for files)**: PhysioNet Credentialed Health Data License Version 1.5.0[2]
13+
- **Data Use Agreement**: Agreement requires verified institutional affiliation and commitment to use data solely for lawful scientific research[2]
14+
- **Required training**: Valid CITI training certification in human research subject protection and HIPAA regulations[2]
15+
- **License Term**: 3 years from account creation date[2]
16+
- **Code Sharing**: Agreement to contribute code associated with publications to open research repository[2]
17+
18+
## Supported Tasks
19+
20+
- `tasks/mortality/in_icu/first_24h.yaml`
21+
22+
## MEDS-transformation
23+
24+
eICU is available in three formats: Original (compatible with benchmark repository), MEDS (Medical Event Data Standard), and OMOP (full OMOP table dumps). The MEDS version contains the same exact data as the Original dataset but in MEDS-compatible format.[1]
25+
26+
## Sources
27+
28+
1. [eICU Physionet Repository](https://physionet.org/content/eicu-crd/2.0/)
29+
2. [PhysioNet Credentialed Access](https://physionet.org/content/eicu-crd/view-license/2.0/)
30+
31+
## Disclaimer
32+
33+
Please refer to the data owners and the most up-to-date information when using this dataset in your research. The eICU dataset has not been reviewed or approved by the Food and Drug Administration and is for non-clinical, research and education use only.[2]
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
metadata:
2+
description: >-
3+
A MEDS version of eICU.
4+
links:
5+
- https://github.com/Medical-Event-Data-Standard/eICU_MEDS
6+
- https://physionet.org/content/eicu-crd/
7+
contacts:
8+
- name: "Robin P. van de Water"
9+
github_username: "rvandewater"
10+
commands:
11+
build_full: >-
12+
MEDS_extract-eICU
13+
raw_input_dir="{temp_dir}/raw"
14+
pre_MEDS_dir="{temp_dir}/pre_MEDS"
15+
MEDS_cohort_dir="{output_dir}"
16+
log_dir="{output_dir}/.pipeline_logs"
17+
build_demo: >-
18+
MEDS_extract-eICU
19+
do_demo=True
20+
raw_input_dir="{temp_dir}/raw"
21+
pre_MEDS_dir="{temp_dir}/pre_MEDS"
22+
MEDS_cohort_dir="{output_dir}"
23+
log_dir="{output_dir}/.pipeline_logs"
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
predicates:
2+
hospital_admission:
3+
code: { regex: "^HOSPITAL_ADMISSION//.*" }
4+
hospital_discharge:
5+
code: { regex: "^HOSPITAL_DISCHARGE//.*" }
6+
7+
ED_registration:
8+
code: { regex: "^ED_REGISTRATION//.*" }
9+
ED_discharge:
10+
code: { regex: "^ED_OUT//.*" }
11+
12+
icu_admission:
13+
code: { regex: "^ICU_ADMISSION//.*" }
14+
icu_discharge:
15+
code: { regex: "^ICU_DISCHARGE//.*" }
16+
17+
creatinine_1:
18+
code: LAB//50912//mg/dL
19+
creatinine_2:
20+
code: LAB//52546//mg/dL
21+
abnormally_high_creatinine_1:
22+
code: LAB//50912//mg/dL
23+
value_min: 1.3 # mg/dL
24+
value_min_inclusive: False
25+
value_max: null
26+
abnormally_high_creatinine_2:
27+
code: LAB//52546//mg/dL
28+
value_min: 1.3 # mg/dL
29+
value_min_inclusive: False
30+
value_max: null
31+
creatinine:
32+
expr: or(creatinine_1, creatinine_2)
33+
abnormally_high_creatinine:
34+
expr: or(abnormally_high_creatinine_1, abnormally_high_creatinine_2)
35+
36+
sodium_1:
37+
code: LAB//220645//mEq/L
38+
sodium_2:
39+
code: LAB//50983//mEq/L
40+
sodium_3:
41+
code: LAB//52623//mEq/L
42+
abnormally_low_sodium_1:
43+
code: LAB//220645//mEq/L
44+
value_min: null
45+
value_max: 135 # mEq/L
46+
value_max_inclusive: False
47+
abnormally_low_sodium_2:
48+
code: LAB//50983//mEq/L
49+
value_min: null
50+
value_max: 135 # mEq/L
51+
value_max_inclusive: False
52+
abnormally_low_sodium_3:
53+
code: LAB//52623//mEq/L
54+
value_min: null
55+
value_max: 135 # mEq/L
56+
value_max_inclusive: False
57+
sodium:
58+
expr: or(sodium_1, sodium_2, sodium_3)
59+
abnormally_low_sodium:
60+
expr: or(abnormally_low_sodium_1, abnormally_low_sodium_2, abnormally_low_sodium_3)
61+
62+
bicarbonate_1:
63+
code: LAB//227443//mEq/L
64+
bicarbonate_2:
65+
code: LAB//50882//mEq/L
66+
abnormally_low_bicarbonate_1:
67+
code: LAB//227443//mEq/L
68+
value_min: null
69+
value_max: 22 # mEq/L
70+
value_max_inclusive: False
71+
abnormally_low_bicarbonate_2:
72+
code: LAB//50882//mEq/L
73+
value_min: null
74+
value_max: 22 # mEq/L
75+
value_max_inclusive: False
76+
bicarbonate:
77+
expr: or(bicarbonate_1, bicarbonate_2)
78+
abnormally_low_bicarbonate:
79+
expr: or(abnormally_low_bicarbonate_1, abnormally_low_bicarbonate_2)
80+
81+
hemoglobin_1:
82+
code: LAB//220228//g/dl
83+
hemoglobin_2:
84+
code: LAB//50811//g/dL
85+
abnormally_low_hemoglobin_1:
86+
code: LAB//220228//g/dl
87+
value_min: null
88+
value_max: 13 # g/dL
89+
value_max_inclusive: False
90+
abnormally_low_hemoglobin_2:
91+
code: LAB//50811//g/dL
92+
value_min: null
93+
value_max: 13 # g/dL
94+
value_max_inclusive: False
95+
hemoglobin:
96+
expr: or(hemoglobin_1, hemoglobin_2)
97+
abnormally_low_hemoglobin:
98+
expr: or(abnormally_low_hemoglobin_1, abnormally_low_hemoglobin_2)
99+
100+
wbc_1:
101+
code: LAB//220546//K/uL
102+
wbc_2:
103+
code: LAB//51300//K/uL
104+
abnormally_high_wbc_1:
105+
code: LAB//220546//K/uL
106+
value_min: 11 # K/uL
107+
value_min_inclusive: False
108+
value_max: null
109+
abnormally_high_wbc_2:
110+
code: LAB//51300//K/uL
111+
value_min: 11 # K/uL
112+
value_min_inclusive: False
113+
value_max: null
114+
wbc:
115+
expr: or(wbc_1, wbc_2)
116+
abnormally_high_wbc:
117+
expr: or(abnormally_high_wbc_1, abnormally_high_wbc_2)
118+
119+
platelets_1:
120+
code: LAB//227457//K/uL
121+
platelets_2:
122+
code: LAB//51265//K/uL
123+
abnormally_low_platelets_1:
124+
code: LAB//227457//K/uL
125+
value_min: null
126+
value_max: 150 # K/uL
127+
value_max_inclusive: False
128+
abnormally_low_platelets_2:
129+
code: LAB//51265//K/uL
130+
value_min: null
131+
value_max: 150 # K/uL
132+
value_max_inclusive: False
133+
platelets:
134+
expr: or(platelets_1, platelets_2)
135+
abnormally_low_platelets:
136+
expr: or(abnormally_low_platelets_1, abnormally_low_platelets_2)
137+
138+
map_1:
139+
code: LAB//220052//mmHg
140+
map_2:
141+
code: LAB//220181//mmHg
142+
map_3:
143+
code: LAB//225312//mmHg
144+
abnormally_low_map_1:
145+
code: LAB//220052//mmHg
146+
value_min: null
147+
value_max: 65 # mmHg
148+
value_max_inclusive: False
149+
abnormally_low_map_2:
150+
code: LAB//220181//mmHg
151+
value_min: null
152+
value_max: 65 # mmHg
153+
value_max_inclusive: False
154+
abnormally_low_map_3:
155+
code: LAB//225312//mmHg
156+
value_min: null
157+
value_max: 65 # mmHg
158+
value_max_inclusive: False
159+
map:
160+
expr: or(map_1, map_2, map_3)
161+
abnormally_low_map:
162+
expr: or(abnormally_low_map_1, abnormally_low_map_2, abnormally_low_map_3)
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
@article{pollardEICUCollaborativeResearch2018,
2+
title = {The {{eICU Collaborative Research Database}}, a Freely Available Multi-Center Database for Critical Care Research},
3+
author = {Pollard, Tom J. and Johnson, Alistair E. W. and Raffa, Jesse D. and Celi, Leo A. and Mark, Roger G. and Badawi, Omar},
4+
year = 2018,
5+
month = dec,
6+
journal = {Scientific Data},
7+
volume = {5},
8+
number = {1},
9+
pages = {180178},
10+
issn = {2052-4463},
11+
doi = {10.1038/sdata.2018.178},
12+
urldate = {2022-03-09},
13+
langid = {english},
14+
keywords = {_tablet},
15+
}
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
eICU-MEDS==0.0.2

tests/conftest.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -325,6 +325,9 @@ def get_opts(config, opt: str) -> list[str]:
325325
if isinstance(out, dict):
326326
out = list(out.keys())
327327

328+
if opt == "dataset":
329+
out = [name for name in out if DATASETS[name]["metadata"].demo_available]
330+
328331
return out
329332

330333

tests/test_registry_validation.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ def test_all_datasets_have_commands():
3131
for name, dataset in DATASETS.items():
3232
commands = dataset.get("commands")
3333
assert commands is not None, f"Dataset {name} missing commands"
34+
metadata = dataset.get("metadata")
35+
if metadata is not None and not metadata.demo_available:
36+
# Datasets that explicitly opt out of a demo recipe don't need build_demo.
37+
continue
3438
assert "build_demo" in commands, f"Dataset {name} missing build_demo command"
3539

3640

0 commit comments

Comments
 (0)