More datasets by rvandewater · Pull Request #258 · Medical-Event-Data-Standard/MEDS-DEV

rvandewater · 2025-10-16T14:55:16Z

No description provided.

…re-datasets

rvandewater · 2025-10-16T18:59:24Z

@mmcdermott What would be the standard procedure for datasets without a demo? Should the command just be an echo statement or something to make the tests run?

mmcdermott · 2026-04-10T18:04:12Z

Review

This is a substantial PR — 6 new datasets (EHRShot, HIRID, INSPIRE, NWICU, SICdb, eICU), AUMCdb updates, task supported_datasets additions, pre-commit modernization, code cleanup, and a run_experiments.sh script.

Retargeting

This PR targets main but should target dev to follow the project's branching convention. Please retarget.

Scope

The PR bundles several distinct concerns:

New datasets (the core contribution)
Pre-commit modernization (black/isort/flake8 → ruff, hook version bumps) — this is already done on dev
Code style refactors (e.g., except KeyError: → except KeyError as e:, isinstance(..., (list, ListConfig)) → isinstance(..., list | ListConfig), ternary rewrites) — already on dev
Task YAML changes (adding supported_datasets, simplifying abnormal_lab predicates) — partially on dev
run_experiments.sh with hardcoded paths (/sc/home/robin.vandewater/datasets/meds) — should not be committed
Readmission task deletion — already done on dev

Once retargeted to dev, the diff will shrink significantly since the pre-commit, code cleanup, and readmission changes are already there.

Dataset definitions

The new datasets look well-structured overall. A few issues:

Missing access_policy in metadata for all new datasets. Per CLAUDE.md, dataset metadata should include an access_policy field (one of: public_with_approval, public_unrestricted, institutional, private_single_use, other).
Missing demo field / demo commands: Several datasets use echo "Demo not available" as build_demo, and EHRShot uses demo: False with echo as build_full. The open comment from @rvandewater about how to handle datasets without demos is still unresolved — this needs a design decision before these can be tested.
Empty __init__.py files: The new datasets add __init__.py files but AUMCdb and MIMIC-IV don't have them. These shouldn't be needed — the dataset registry is populated via importlib.resources, not Python package imports.
run_experiments.sh: Contains hardcoded user-specific paths. Should be removed from the PR or converted to a generic template.
eICU predicates file: The diff shows src/MEDS_DEV/datasets/MIMIC-IV/predicates.yaml being copied to src/MEDS_DEV/datasets/eICU/predicates.yaml — is this intentional or a copy error? eICU should have its own predicates.

Open question

@rvandewater's question about datasets without demos still needs an answer — this affects testability of most of the new datasets.

Recommendation

Retarget to dev
Remove run_experiments.sh (or .gitignore it)
Remove the __init__.py files from new datasets
Add access_policy to all dataset metadata
Resolve the demo-less dataset question
Verify eICU predicates are correct (not a MIMIC-IV copy)
After retargeting, the pre-commit/code-cleanup changes should drop out of the diff, making review much easier

mmcdermott · 2026-04-10T18:05:01Z

Detailed review (follow-up)

A deeper pass found several critical data correctness issues beyond the structural points in my earlier comment.

Critical (will produce wrong results or break existing functionality)

MIMIC-IV predicates.yaml deleted: The diff shows src/MEDS_DEV/datasets/MIMIC-IV/predicates.yaml was renamed to src/MEDS_DEV/datasets/eICU/predicates.yaml (100% similarity). This deletes MIMIC-IV's predicates entirely and gives eICU a copy of MIMIC-IV predicates with MIMIC-specific codes (LAB//50912//mg/dL, etc.). eICU needs its own predicates, and MIMIC-IV's must not be removed.
AUMCdb predicates: unit mismatch: AUMCdb codes are in umol/L and mmol/L (e.g., MEASURE//Kreatinine (bloed)//umol/l) but thresholds use mg/dL values (e.g., value_min: 1.3 # mg/dL). Creatinine 1.3 mg/dL ≈ 115 umol/L — off by ~88x. Comments in the file say "Todo: convert to mg/dL". Same issue for hemoglobin (mmol/L codes with g/dL thresholds). These predicates will produce incorrect labels.
AUMCdb broken or() expressions: sodium references sodium_1, sodium_2, sodium_3 but sodium_2 and sodium_3 are commented out. Same for abnormally_low_sodium. Will fail at runtime.
Regex bugs in HIRID, INSPIRE, SICdb: Patterns like code: { regex: "^HOSPITAL_ADMISSION*" } — the * means "zero or more of preceding char" in regex, not wildcard. Should be "^HOSPITAL_ADMISSION.*" (with the dot).
Task predicate renames without updating MIMIC-IV: All 7 abnormal_lab tasks rename predicates from unit-specific (creatinine_mgdl) to unit-agnostic (creatinine). But MIMIC-IV's predicates file (which is being deleted — see Adding task description readmes #1) would also need updating to match. Doubly broken.

Significant

hydra/launcher=joblib hardcoded in tasks/__main__.py: Changes the default ACES launcher for all task extractions. This is a behavioral change affecting parallelism and error handling — should it be configurable rather than a hardcoded default?
Missing death predicate: HIRID, INSPIRE, NWICU, SICdb don't define death in their predicates. Mortality tasks reference death: ??? and need each dataset's predicates to define it.
EHRShot duplicate codes: wbc_2 and wbc_3 have identical code LOINC/736-9. Also abnormally_low_platelets_kul has a _kul suffix inconsistent with task file expectations.
Commented-out predicates: INSPIRE has bicarbonate_meql and hemoglobin_gdl commented out (metabolic_acidosis and anemia tasks won't work). SICdb has most predicates commented out — only creatinine is active.

Before merge (updated checklist)

Replicates the dataset additions from #258 on top of current dev (the original branch is too far behind to merge directly; this PR keeps only the dataset/task content, not the stale infra reverts). Datasets added (`src/MEDS_DEV/datasets/<name>/`): - EHRShot — Stanford EHR cohort with pre-built MEDS extraction. - HIRID — Bern ICU dataset via MEDS_extract-HIRID. - INSPIRE — perioperative dataset via MEDS_extract-INSPIRE. - NWICU — Northwestern ICU dataset via NWICU_MEDS. - SICdb — Salzburg ICU dataset via MEDS_extract-SICdb. - eICU — multi-center US ICU dataset via MEDS_extract-eICU (with demo). AUMCdb is also completed (was previously just predicates.yaml + README): adds dataset.yaml, requirements.txt, refs.bib, and the full ICU predicate set from the upstream PR. Tasks: mortality/in_icu/first_24h now lists AUMCdb and NWICU under supported_datasets in addition to MIMIC-IV. MIMIC-IV/README.md: pulled the longer description + access-requirements write-up from #258 (replaces the TODO placeholders). Each dataset.yaml has a `build_demo` command — for datasets without a real demo recipe, this is a stub echo so registry validation passes (matching the pattern HIRID already used in the source PR). Co-Authored-By: Robin P. van de Water <rvandewater@users.noreply.github.com> Co-Authored-By: Patrick Rockenschaub <prockenschaub@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rvandewater added 16 commits September 24, 2025 11:21

Add aumcdb file

d17d092

added nwicu and added aumcdb support

7ac5c84

Merge remote-tracking branch 'origin/dev' into dataset-nwicu

97e8e07

pre-commit

9e2650c

sicdb start

69e4dc6

correction to catch in_icu death

0684d17

Added inspire

a8007ff

starting out with hirid

aff61ec

Hirid and eicu updates

ed85f89

update predicates

e6f85d9

eicu predicates and bash script to run over all datasets, hirid readme

ff75787

add demo for eicu

204f7ee

Add ehrshot

9cfe660

added more predicates for aumcdb

29fc2a6

refs files

b1920b8

requirements and refs

a61e9d7

rvandewater changed the base branch from dev to main October 16, 2025 14:55

Merge branch 'main' of https://github.com/mmcdermott/MEDS-DEV into mo…

a9c2e0e

…re-datasets

rvandewater added 6 commits October 17, 2025 10:18

predicates indent

8f7af67

test joblib and readmes

0be40fb

predicates and readmes

db80e6b

demo command

74202d7

demo test

a28f54d

add key

853514a

mmcdermott changed the base branch from main to dev April 10, 2026 17:46

mmcdermott mentioned this pull request May 12, 2026

Add six new datasets (EHRShot, HIRID, INSPIRE, NWICU, SICdb, eICU) + complete AUMCdb #299

Closed

3 tasks

This was referenced May 13, 2026

Complete the AUMCdb dataset registration #305

Draft

Add the EHRShot dataset #306

Draft

Add the HIRID dataset #307

Draft

Add the INSPIRE dataset #308

Draft

Add the NWICU dataset #309

Draft

Add the SICdb dataset #310

Draft

Add the eICU dataset #311

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More datasets#258

More datasets#258
rvandewater wants to merge 23 commits into
Medical-Event-Data-Standard:devfrom
rvandewater:more-datasets

rvandewater commented Oct 16, 2025

Uh oh!

rvandewater commented Oct 16, 2025

Uh oh!

mmcdermott commented Apr 10, 2026

Uh oh!

mmcdermott commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rvandewater commented Oct 16, 2025

Uh oh!

rvandewater commented Oct 16, 2025

Uh oh!

mmcdermott commented Apr 10, 2026

Review

Retargeting

Scope

Dataset definitions

Open question

Recommendation

Uh oh!

mmcdermott commented Apr 10, 2026

Detailed review (follow-up)

Critical (will produce wrong results or break existing functionality)

Significant

Before merge (updated checklist)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants