shard_events OOMs on chartevents/labevents on a 121 GB box (gzipped CSV + joblib launcher fan-out)

## Summary

Running `MEDS_extract-MIMIC_IV` end-to-end on a stock 121 GB-RAM box reliably OOM-kills the python worker during `shard_events` on `icu/chartevents.csv.gz` (~3.3 GB compressed, 433 M rows × 36 cols). Verified twice from kernel log:

```
Apr 25 18:17:51 kernel: Out of memory: Killed process 150828 (python) total-vm:244018756kB, anon-rss:124380180kB ...
Apr 27 11:45:17 kernel: Out of memory: Killed process 167660 (python) total-vm:210715496kB, anon-rss:124427988kB ...
```

(124 GB resident vs 121 GB physical → SIGKILL.) The Apr 25 OOM was severe enough that the kernel reaped the entire `tmux-spawn-...` cgroup that the pipeline was running in, so it took out the wrapper bash plus its monitor too.

This is the user-facing failure. The deeper "why" is in `MEDS_extract` (filed as a separate issue against `mmcdermott/MEDS_extract`) — see the cross-reference at the bottom — but anyone running this pipeline with `< ~200 GB` RAM hits it, and MIMIC owns the configs that point `shard_events` at the gzipped CSV.

## Reproduction

```bash
uv venv .venv --python 3.12
uv pip install --python .venv/bin/python MIMIC-IV-MEDS==0.1.2 'meds-extract==0.5.0' 'dftly==0.1.5'
.venv/bin/MEDS_extract-MIMIC_IV root_output_dir=/path/out download_workers=8
# Download succeeds (with #51 worked around). pre_MEDS succeeds. shard_events
# starts processing chartevents and gets SIGKILL'd by the kernel ~5-10 min in.
```

Last log line before kill is consistently:
```
[...][MEDS_extract.shard_events.shard_events][WARNING] - Reading compressed CSV files may be slow and limit parallelizability.
```
followed by
```
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. ... The exit codes of the workers are {SIGKILL(-9)}
```

## Why this happens (short version)

`shard_events` slices each input into row-range parquet chunks. For a `.csv.gz` source, polars can't seek into the gzip stream, so `scan_csv(...).slice(start, length).collect()` per-chunk has to **decompress and parse the whole file from byte 0** — every chunk worker materializes a near-full copy of chartevents in memory (~50 GB working set). The hydra/joblib launcher then runs multiple chunks in parallel (we observed ≥2 simultaneous `.lock` files on chartevents subshards), so 2 × ~50 GB blows past the 121 GB cliff.

Full upstream root-cause writeup: see [issue on `mmcdermott/MEDS_extract` — TODO link once filed].

## Workarounds (in increasing complexity)

These all sidestep the gzip-isn't-seekable problem upstream of `shard_events`. Order tested in real-world preference:

### A. Pre-decompress the big files (simplest)

Just gunzip the offending `.csv.gz` files in place under `pre_MEDS/`. polars *can* slice-pushdown on uncompressed `.csv`, so each chunk worker only loads its own slice (~10-15 GB working set instead of ~50 GB). MEDS_extract supports `.csv` natively so no other code changes:

```bash
# Under pre_MEDS/, replace the symlinks for the largest files
PRE_MEDS=/path/data/pre_MEDS
for rel in icu/chartevents hosp/labevents; do
    rm "$PRE_MEDS/$rel.csv.gz"
    gunzip -c "$(readlink "$PRE_MEDS/$rel.csv.gz" 2>/dev/null || echo "$rel.csv.gz")" \
        > "$PRE_MEDS/$rel.csv"
done
# … then re-run MEDS_extract-MIMIC_IV; shard_events will pick up the .csv files.
```

Disk cost: chartevents.csv.gz (3.3 GB) → chartevents.csv (~30 GB). On a box with 3.4 TB free this is fine; on a tight box it may not be.

### B. Pre-convert to parquet

```python
import polars as pl
pl.scan_csv("chartevents.csv.gz", infer_schema_length=10000) \
  .sink_parquet("chartevents.parquet", compression="zstd")
# 20s wall, ~5GB peak resident, output is ~1.5GB on disk.
```

Then place `chartevents.parquet` at `pre_MEDS/icu/chartevents.parquet` (replacing the symlink). MEDS_extract supports `.parquet` natively. More memory-friendly than (A) at slice time (column-projection + row-group seek), and the on-disk size is *smaller* than the original `.csv.gz`, but does cost a one-time conversion step and per-stage parquet schema concerns.

### C. Stage runner with per-stage `parallelize: null`

`MEDS-transforms` supports a `stage_runner_fp` YAML that lets you override per-stage launcher behavior. MIMIC plumbs this through (`MIMIC_IV_MEDS/__main__.py:28`, `--stage_runner_fp=` on the subprocess), but it doesn't ship a default and doesn't expose a Hydra knob to set one. A stage runner like:

```yaml
shard_events:
  parallelize: null
convert_to_subject_sharded:
  parallelize: null
```

would force the basic sweeper (no joblib workers) for those stages, so only one chunk loads chartevents at a time — bringing peak memory back inside the box's RAM. This doesn't *avoid* the per-chunk full-file read on gzipped sources (that's the upstream bug), but it makes the per-chunk OOM fit instead of multiplying it by N.

The user-friendly version of this would be MIMIC shipping a `runners/safe.yaml` and accepting `runner=safe` on the CLI to wire it up. Happy to send a PR if there's interest.

## What I'd suggest fixing

- **Short-term**: ship a default `stage_runner_fp` (or `runner=safe` config preset) that sets `parallelize: null` on `shard_events` (and probably `convert_to_subject_sharded`), so the released pipeline doesn't OOM out of the box on a 121 GB host. (C) above.
- **Doc**: add a "Memory" note to the README pointing at workaround (A) for users who hit this and want to re-run quickly.
- **Real fix**: lives upstream in `MEDS_extract` (cross-ref).

## Cross-reference

Upstream root cause + proposed fix in `mmcdermott/MEDS_extract`: TODO-link.

Related compatibility issues that gated reaching this stage:
- #51 — `MEDS_transform-pipeline` not on PATH when venv isn't activated.
- #53 — `meds-extract~=0.5` constraint admits 0.6.x which is config-incompatible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shard_events OOMs on chartevents/labevents on a 121 GB box (gzipped CSV + joblib launcher fan-out) #55

Summary

Reproduction

Why this happens (short version)

Workarounds (in increasing complexity)

A. Pre-decompress the big files (simplest)

B. Pre-convert to parquet

C. Stage runner with per-stage `parallelize: null`

What I'd suggest fixing

Cross-reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

shard_events OOMs on chartevents/labevents on a 121 GB box (gzipped CSV + joblib launcher fan-out) #55

Description

Summary

Reproduction

Why this happens (short version)

Workarounds (in increasing complexity)

A. Pre-decompress the big files (simplest)

B. Pre-convert to parquet

C. Stage runner with per-stage parallelize: null

What I'd suggest fixing

Cross-reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

C. Stage runner with per-stage `parallelize: null`