Summary
Running MEDS_extract-MIMIC_IV end-to-end on a stock 121 GB-RAM box reliably OOM-kills the python worker during shard_events on icu/chartevents.csv.gz (~3.3 GB compressed, 433 M rows × 36 cols). Verified twice from kernel log:
Apr 25 18:17:51 kernel: Out of memory: Killed process 150828 (python) total-vm:244018756kB, anon-rss:124380180kB ...
Apr 27 11:45:17 kernel: Out of memory: Killed process 167660 (python) total-vm:210715496kB, anon-rss:124427988kB ...
(124 GB resident vs 121 GB physical → SIGKILL.) The Apr 25 OOM was severe enough that the kernel reaped the entire tmux-spawn-... cgroup that the pipeline was running in, so it took out the wrapper bash plus its monitor too.
This is the user-facing failure. The deeper "why" is in MEDS_extract (filed as a separate issue against mmcdermott/MEDS_extract) — see the cross-reference at the bottom — but anyone running this pipeline with < ~200 GB RAM hits it, and MIMIC owns the configs that point shard_events at the gzipped CSV.
Reproduction
uv venv .venv --python 3.12
uv pip install --python .venv/bin/python MIMIC-IV-MEDS==0.1.2 'meds-extract==0.5.0' 'dftly==0.1.5'
.venv/bin/MEDS_extract-MIMIC_IV root_output_dir=/path/out download_workers=8
# Download succeeds (with #51 worked around). pre_MEDS succeeds. shard_events
# starts processing chartevents and gets SIGKILL'd by the kernel ~5-10 min in.
Last log line before kill is consistently:
[...][MEDS_extract.shard_events.shard_events][WARNING] - Reading compressed CSV files may be slow and limit parallelizability.
followed by
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. ... The exit codes of the workers are {SIGKILL(-9)}
Why this happens (short version)
shard_events slices each input into row-range parquet chunks. For a .csv.gz source, polars can't seek into the gzip stream, so scan_csv(...).slice(start, length).collect() per-chunk has to decompress and parse the whole file from byte 0 — every chunk worker materializes a near-full copy of chartevents in memory (~50 GB working set). The hydra/joblib launcher then runs multiple chunks in parallel (we observed ≥2 simultaneous .lock files on chartevents subshards), so 2 × ~50 GB blows past the 121 GB cliff.
Full upstream root-cause writeup: see [issue on mmcdermott/MEDS_extract — TODO link once filed].
Workarounds (in increasing complexity)
These all sidestep the gzip-isn't-seekable problem upstream of shard_events. Order tested in real-world preference:
A. Pre-decompress the big files (simplest)
Just gunzip the offending .csv.gz files in place under pre_MEDS/. polars can slice-pushdown on uncompressed .csv, so each chunk worker only loads its own slice (~10-15 GB working set instead of ~50 GB). MEDS_extract supports .csv natively so no other code changes:
# Under pre_MEDS/, replace the symlinks for the largest files
PRE_MEDS=/path/data/pre_MEDS
for rel in icu/chartevents hosp/labevents; do
rm "$PRE_MEDS/$rel.csv.gz"
gunzip -c "$(readlink "$PRE_MEDS/$rel.csv.gz" 2>/dev/null || echo "$rel.csv.gz")" \
> "$PRE_MEDS/$rel.csv"
done
# … then re-run MEDS_extract-MIMIC_IV; shard_events will pick up the .csv files.
Disk cost: chartevents.csv.gz (3.3 GB) → chartevents.csv (~30 GB). On a box with 3.4 TB free this is fine; on a tight box it may not be.
B. Pre-convert to parquet
import polars as pl
pl.scan_csv("chartevents.csv.gz", infer_schema_length=10000) \
.sink_parquet("chartevents.parquet", compression="zstd")
# 20s wall, ~5GB peak resident, output is ~1.5GB on disk.
Then place chartevents.parquet at pre_MEDS/icu/chartevents.parquet (replacing the symlink). MEDS_extract supports .parquet natively. More memory-friendly than (A) at slice time (column-projection + row-group seek), and the on-disk size is smaller than the original .csv.gz, but does cost a one-time conversion step and per-stage parquet schema concerns.
C. Stage runner with per-stage parallelize: null
MEDS-transforms supports a stage_runner_fp YAML that lets you override per-stage launcher behavior. MIMIC plumbs this through (MIMIC_IV_MEDS/__main__.py:28, --stage_runner_fp= on the subprocess), but it doesn't ship a default and doesn't expose a Hydra knob to set one. A stage runner like:
shard_events:
parallelize: null
convert_to_subject_sharded:
parallelize: null
would force the basic sweeper (no joblib workers) for those stages, so only one chunk loads chartevents at a time — bringing peak memory back inside the box's RAM. This doesn't avoid the per-chunk full-file read on gzipped sources (that's the upstream bug), but it makes the per-chunk OOM fit instead of multiplying it by N.
The user-friendly version of this would be MIMIC shipping a runners/safe.yaml and accepting runner=safe on the CLI to wire it up. Happy to send a PR if there's interest.
What I'd suggest fixing
- Short-term: ship a default
stage_runner_fp (or runner=safe config preset) that sets parallelize: null on shard_events (and probably convert_to_subject_sharded), so the released pipeline doesn't OOM out of the box on a 121 GB host. (C) above.
- Doc: add a "Memory" note to the README pointing at workaround (A) for users who hit this and want to re-run quickly.
- Real fix: lives upstream in
MEDS_extract (cross-ref).
Cross-reference
Upstream root cause + proposed fix in mmcdermott/MEDS_extract: TODO-link.
Related compatibility issues that gated reaching this stage:
Summary
Running
MEDS_extract-MIMIC_IVend-to-end on a stock 121 GB-RAM box reliably OOM-kills the python worker duringshard_eventsonicu/chartevents.csv.gz(~3.3 GB compressed, 433 M rows × 36 cols). Verified twice from kernel log:(124 GB resident vs 121 GB physical → SIGKILL.) The Apr 25 OOM was severe enough that the kernel reaped the entire
tmux-spawn-...cgroup that the pipeline was running in, so it took out the wrapper bash plus its monitor too.This is the user-facing failure. The deeper "why" is in
MEDS_extract(filed as a separate issue againstmmcdermott/MEDS_extract) — see the cross-reference at the bottom — but anyone running this pipeline with< ~200 GBRAM hits it, and MIMIC owns the configs that pointshard_eventsat the gzipped CSV.Reproduction
Last log line before kill is consistently:
followed by
Why this happens (short version)
shard_eventsslices each input into row-range parquet chunks. For a.csv.gzsource, polars can't seek into the gzip stream, soscan_csv(...).slice(start, length).collect()per-chunk has to decompress and parse the whole file from byte 0 — every chunk worker materializes a near-full copy of chartevents in memory (~50 GB working set). The hydra/joblib launcher then runs multiple chunks in parallel (we observed ≥2 simultaneous.lockfiles on chartevents subshards), so 2 × ~50 GB blows past the 121 GB cliff.Full upstream root-cause writeup: see [issue on
mmcdermott/MEDS_extract— TODO link once filed].Workarounds (in increasing complexity)
These all sidestep the gzip-isn't-seekable problem upstream of
shard_events. Order tested in real-world preference:A. Pre-decompress the big files (simplest)
Just gunzip the offending
.csv.gzfiles in place underpre_MEDS/. polars can slice-pushdown on uncompressed.csv, so each chunk worker only loads its own slice (~10-15 GB working set instead of ~50 GB). MEDS_extract supports.csvnatively so no other code changes:Disk cost: chartevents.csv.gz (3.3 GB) → chartevents.csv (~30 GB). On a box with 3.4 TB free this is fine; on a tight box it may not be.
B. Pre-convert to parquet
Then place
chartevents.parquetatpre_MEDS/icu/chartevents.parquet(replacing the symlink). MEDS_extract supports.parquetnatively. More memory-friendly than (A) at slice time (column-projection + row-group seek), and the on-disk size is smaller than the original.csv.gz, but does cost a one-time conversion step and per-stage parquet schema concerns.C. Stage runner with per-stage
parallelize: nullMEDS-transformssupports astage_runner_fpYAML that lets you override per-stage launcher behavior. MIMIC plumbs this through (MIMIC_IV_MEDS/__main__.py:28,--stage_runner_fp=on the subprocess), but it doesn't ship a default and doesn't expose a Hydra knob to set one. A stage runner like:would force the basic sweeper (no joblib workers) for those stages, so only one chunk loads chartevents at a time — bringing peak memory back inside the box's RAM. This doesn't avoid the per-chunk full-file read on gzipped sources (that's the upstream bug), but it makes the per-chunk OOM fit instead of multiplying it by N.
The user-friendly version of this would be MIMIC shipping a
runners/safe.yamland acceptingrunner=safeon the CLI to wire it up. Happy to send a PR if there's interest.What I'd suggest fixing
stage_runner_fp(orrunner=safeconfig preset) that setsparallelize: nullonshard_events(and probablyconvert_to_subject_sharded), so the released pipeline doesn't OOM out of the box on a 121 GB host. (C) above.MEDS_extract(cross-ref).Cross-reference
Upstream root cause + proposed fix in
mmcdermott/MEDS_extract: TODO-link.Related compatibility issues that gated reaching this stage:
MEDS_transform-pipelinenot on PATH when venv isn't activated.meds-extract~=0.5constraint admits 0.6.x which is config-incompatible.