Skip to content

More datasets#258

Open
rvandewater wants to merge 23 commits into
Medical-Event-Data-Standard:devfrom
rvandewater:more-datasets
Open

More datasets#258
rvandewater wants to merge 23 commits into
Medical-Event-Data-Standard:devfrom
rvandewater:more-datasets

Conversation

@rvandewater
Copy link
Copy Markdown
Collaborator

No description provided.

@rvandewater rvandewater changed the base branch from dev to main October 16, 2025 14:55
@rvandewater
Copy link
Copy Markdown
Collaborator Author

@mmcdermott What would be the standard procedure for datasets without a demo? Should the command just be an echo statement or something to make the tests run?

@mmcdermott mmcdermott changed the base branch from main to dev April 10, 2026 17:46
@mmcdermott
Copy link
Copy Markdown
Collaborator

Review

This is a substantial PR — 6 new datasets (EHRShot, HIRID, INSPIRE, NWICU, SICdb, eICU), AUMCdb updates, task supported_datasets additions, pre-commit modernization, code cleanup, and a run_experiments.sh script.

Retargeting

This PR targets main but should target dev to follow the project's branching convention. Please retarget.

Scope

The PR bundles several distinct concerns:

  • New datasets (the core contribution)
  • Pre-commit modernization (black/isort/flake8 → ruff, hook version bumps) — this is already done on dev
  • Code style refactors (e.g., except KeyError:except KeyError as e:, isinstance(..., (list, ListConfig))isinstance(..., list | ListConfig), ternary rewrites) — already on dev
  • Task YAML changes (adding supported_datasets, simplifying abnormal_lab predicates) — partially on dev
  • run_experiments.sh with hardcoded paths (/sc/home/robin.vandewater/datasets/meds) — should not be committed
  • Readmission task deletion — already done on dev

Once retargeted to dev, the diff will shrink significantly since the pre-commit, code cleanup, and readmission changes are already there.

Dataset definitions

The new datasets look well-structured overall. A few issues:

  1. Missing access_policy in metadata for all new datasets. Per CLAUDE.md, dataset metadata should include an access_policy field (one of: public_with_approval, public_unrestricted, institutional, private_single_use, other).

  2. Missing demo field / demo commands: Several datasets use echo "Demo not available" as build_demo, and EHRShot uses demo: False with echo as build_full. The open comment from @rvandewater about how to handle datasets without demos is still unresolved — this needs a design decision before these can be tested.

  3. Empty __init__.py files: The new datasets add __init__.py files but AUMCdb and MIMIC-IV don't have them. These shouldn't be needed — the dataset registry is populated via importlib.resources, not Python package imports.

  4. run_experiments.sh: Contains hardcoded user-specific paths. Should be removed from the PR or converted to a generic template.

  5. eICU predicates file: The diff shows src/MEDS_DEV/datasets/MIMIC-IV/predicates.yaml being copied to src/MEDS_DEV/datasets/eICU/predicates.yaml — is this intentional or a copy error? eICU should have its own predicates.

Open question

@rvandewater's question about datasets without demos still needs an answer — this affects testability of most of the new datasets.

Recommendation

  1. Retarget to dev
  2. Remove run_experiments.sh (or .gitignore it)
  3. Remove the __init__.py files from new datasets
  4. Add access_policy to all dataset metadata
  5. Resolve the demo-less dataset question
  6. Verify eICU predicates are correct (not a MIMIC-IV copy)
  7. After retargeting, the pre-commit/code-cleanup changes should drop out of the diff, making review much easier

@mmcdermott
Copy link
Copy Markdown
Collaborator

Detailed review (follow-up)

A deeper pass found several critical data correctness issues beyond the structural points in my earlier comment.

Critical (will produce wrong results or break existing functionality)

  1. MIMIC-IV predicates.yaml deleted: The diff shows src/MEDS_DEV/datasets/MIMIC-IV/predicates.yaml was renamed to src/MEDS_DEV/datasets/eICU/predicates.yaml (100% similarity). This deletes MIMIC-IV's predicates entirely and gives eICU a copy of MIMIC-IV predicates with MIMIC-specific codes (LAB//50912//mg/dL, etc.). eICU needs its own predicates, and MIMIC-IV's must not be removed.

  2. AUMCdb predicates: unit mismatch: AUMCdb codes are in umol/L and mmol/L (e.g., MEASURE//Kreatinine (bloed)//umol/l) but thresholds use mg/dL values (e.g., value_min: 1.3 # mg/dL). Creatinine 1.3 mg/dL ≈ 115 umol/L — off by ~88x. Comments in the file say "Todo: convert to mg/dL". Same issue for hemoglobin (mmol/L codes with g/dL thresholds). These predicates will produce incorrect labels.

  3. AUMCdb broken or() expressions: sodium references sodium_1, sodium_2, sodium_3 but sodium_2 and sodium_3 are commented out. Same for abnormally_low_sodium. Will fail at runtime.

  4. Regex bugs in HIRID, INSPIRE, SICdb: Patterns like code: { regex: "^HOSPITAL_ADMISSION*" } — the * means "zero or more of preceding char" in regex, not wildcard. Should be "^HOSPITAL_ADMISSION.*" (with the dot).

  5. Task predicate renames without updating MIMIC-IV: All 7 abnormal_lab tasks rename predicates from unit-specific (creatinine_mgdl) to unit-agnostic (creatinine). But MIMIC-IV's predicates file (which is being deleted — see Adding task description readmes #1) would also need updating to match. Doubly broken.

Significant

  1. hydra/launcher=joblib hardcoded in tasks/__main__.py: Changes the default ACES launcher for all task extractions. This is a behavioral change affecting parallelism and error handling — should it be configurable rather than a hardcoded default?

  2. Missing death predicate: HIRID, INSPIRE, NWICU, SICdb don't define death in their predicates. Mortality tasks reference death: ??? and need each dataset's predicates to define it.

  3. EHRShot duplicate codes: wbc_2 and wbc_3 have identical code LOINC/736-9. Also abnormally_low_platelets_kul has a _kul suffix inconsistent with task file expectations.

  4. Commented-out predicates: INSPIRE has bicarbonate_meql and hemoglobin_gdl commented out (metabolic_acidosis and anemia tasks won't work). SICdb has most predicates commented out — only creatinine is active.

Before merge (updated checklist)

  • Restore MIMIC-IV predicates.yaml (do not rename to eICU)
  • Create proper eICU-specific predicates
  • Update MIMIC-IV predicates to use new unit-agnostic names
  • Fix AUMCdb unit thresholds (convert values or fix comments)
  • Fix AUMCdb broken or() expressions referencing commented-out predicates
  • Fix regex patterns in HIRID, INSPIRE, SICdb (*.*)
  • Remove run_experiments.sh
  • Add death predicate to datasets supporting mortality tasks
  • Fix EHRShot duplicate wbc codes
  • Reconsider hardcoded hydra/launcher=joblib
  • Remove __init__.py files from new datasets
  • Retarget to dev (the pre-commit/code-cleanup diff will shrink significantly)

mmcdermott added a commit that referenced this pull request May 13, 2026
Replicates the dataset additions from #258 on top of current dev (the
original branch is too far behind to merge directly; this PR keeps only
the dataset/task content, not the stale infra reverts).

Datasets added (`src/MEDS_DEV/datasets/<name>/`):
  - EHRShot — Stanford EHR cohort with pre-built MEDS extraction.
  - HIRID — Bern ICU dataset via MEDS_extract-HIRID.
  - INSPIRE — perioperative dataset via MEDS_extract-INSPIRE.
  - NWICU — Northwestern ICU dataset via NWICU_MEDS.
  - SICdb — Salzburg ICU dataset via MEDS_extract-SICdb.
  - eICU — multi-center US ICU dataset via MEDS_extract-eICU (with demo).

AUMCdb is also completed (was previously just predicates.yaml + README):
adds dataset.yaml, requirements.txt, refs.bib, and the full ICU
predicate set from the upstream PR.

Tasks: mortality/in_icu/first_24h now lists AUMCdb and NWICU under
supported_datasets in addition to MIMIC-IV.

MIMIC-IV/README.md: pulled the longer description + access-requirements
write-up from #258 (replaces the TODO placeholders).

Each dataset.yaml has a `build_demo` command — for datasets without a
real demo recipe, this is a stub echo so registry validation passes
(matching the pattern HIRID already used in the source PR).

Co-Authored-By: Robin P. van de Water <rvandewater@users.noreply.github.com>
Co-Authored-By: Patrick Rockenschaub <prockenschaub@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants