Skip to content

shard_events OOMs on chartevents/labevents on a 121 GB box (gzipped CSV + joblib launcher fan-out) #55

@mmcdermott

Description

@mmcdermott

Summary

Running MEDS_extract-MIMIC_IV end-to-end on a stock 121 GB-RAM box reliably OOM-kills the python worker during shard_events on icu/chartevents.csv.gz (~3.3 GB compressed, 433 M rows × 36 cols). Verified twice from kernel log:

Apr 25 18:17:51 kernel: Out of memory: Killed process 150828 (python) total-vm:244018756kB, anon-rss:124380180kB ...
Apr 27 11:45:17 kernel: Out of memory: Killed process 167660 (python) total-vm:210715496kB, anon-rss:124427988kB ...

(124 GB resident vs 121 GB physical → SIGKILL.) The Apr 25 OOM was severe enough that the kernel reaped the entire tmux-spawn-... cgroup that the pipeline was running in, so it took out the wrapper bash plus its monitor too.

This is the user-facing failure. The deeper "why" is in MEDS_extract (filed as a separate issue against mmcdermott/MEDS_extract) — see the cross-reference at the bottom — but anyone running this pipeline with < ~200 GB RAM hits it, and MIMIC owns the configs that point shard_events at the gzipped CSV.

Reproduction

uv venv .venv --python 3.12
uv pip install --python .venv/bin/python MIMIC-IV-MEDS==0.1.2 'meds-extract==0.5.0' 'dftly==0.1.5'
.venv/bin/MEDS_extract-MIMIC_IV root_output_dir=/path/out download_workers=8
# Download succeeds (with #51 worked around). pre_MEDS succeeds. shard_events
# starts processing chartevents and gets SIGKILL'd by the kernel ~5-10 min in.

Last log line before kill is consistently:

[...][MEDS_extract.shard_events.shard_events][WARNING] - Reading compressed CSV files may be slow and limit parallelizability.

followed by

joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. ... The exit codes of the workers are {SIGKILL(-9)}

Why this happens (short version)

shard_events slices each input into row-range parquet chunks. For a .csv.gz source, polars can't seek into the gzip stream, so scan_csv(...).slice(start, length).collect() per-chunk has to decompress and parse the whole file from byte 0 — every chunk worker materializes a near-full copy of chartevents in memory (~50 GB working set). The hydra/joblib launcher then runs multiple chunks in parallel (we observed ≥2 simultaneous .lock files on chartevents subshards), so 2 × ~50 GB blows past the 121 GB cliff.

Full upstream root-cause writeup: see [issue on mmcdermott/MEDS_extract — TODO link once filed].

Workarounds (in increasing complexity)

These all sidestep the gzip-isn't-seekable problem upstream of shard_events. Order tested in real-world preference:

A. Pre-decompress the big files (simplest)

Just gunzip the offending .csv.gz files in place under pre_MEDS/. polars can slice-pushdown on uncompressed .csv, so each chunk worker only loads its own slice (~10-15 GB working set instead of ~50 GB). MEDS_extract supports .csv natively so no other code changes:

# Under pre_MEDS/, replace the symlinks for the largest files
PRE_MEDS=/path/data/pre_MEDS
for rel in icu/chartevents hosp/labevents; do
    rm "$PRE_MEDS/$rel.csv.gz"
    gunzip -c "$(readlink "$PRE_MEDS/$rel.csv.gz" 2>/dev/null || echo "$rel.csv.gz")" \
        > "$PRE_MEDS/$rel.csv"
done
# … then re-run MEDS_extract-MIMIC_IV; shard_events will pick up the .csv files.

Disk cost: chartevents.csv.gz (3.3 GB) → chartevents.csv (~30 GB). On a box with 3.4 TB free this is fine; on a tight box it may not be.

B. Pre-convert to parquet

import polars as pl
pl.scan_csv("chartevents.csv.gz", infer_schema_length=10000) \
  .sink_parquet("chartevents.parquet", compression="zstd")
# 20s wall, ~5GB peak resident, output is ~1.5GB on disk.

Then place chartevents.parquet at pre_MEDS/icu/chartevents.parquet (replacing the symlink). MEDS_extract supports .parquet natively. More memory-friendly than (A) at slice time (column-projection + row-group seek), and the on-disk size is smaller than the original .csv.gz, but does cost a one-time conversion step and per-stage parquet schema concerns.

C. Stage runner with per-stage parallelize: null

MEDS-transforms supports a stage_runner_fp YAML that lets you override per-stage launcher behavior. MIMIC plumbs this through (MIMIC_IV_MEDS/__main__.py:28, --stage_runner_fp= on the subprocess), but it doesn't ship a default and doesn't expose a Hydra knob to set one. A stage runner like:

shard_events:
  parallelize: null
convert_to_subject_sharded:
  parallelize: null

would force the basic sweeper (no joblib workers) for those stages, so only one chunk loads chartevents at a time — bringing peak memory back inside the box's RAM. This doesn't avoid the per-chunk full-file read on gzipped sources (that's the upstream bug), but it makes the per-chunk OOM fit instead of multiplying it by N.

The user-friendly version of this would be MIMIC shipping a runners/safe.yaml and accepting runner=safe on the CLI to wire it up. Happy to send a PR if there's interest.

What I'd suggest fixing

  • Short-term: ship a default stage_runner_fp (or runner=safe config preset) that sets parallelize: null on shard_events (and probably convert_to_subject_sharded), so the released pipeline doesn't OOM out of the box on a 121 GB host. (C) above.
  • Doc: add a "Memory" note to the README pointing at workaround (A) for users who hit this and want to re-run quickly.
  • Real fix: lives upstream in MEDS_extract (cross-ref).

Cross-reference

Upstream root cause + proposed fix in mmcdermott/MEDS_extract: TODO-link.

Related compatibility issues that gated reaching this stage:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions