Skip to content

Add the eICU dataset#311

Draft
mmcdermott wants to merge 1 commit into
devfrom
feat/add-dataset-eicu
Draft

Add the eICU dataset#311
mmcdermott wants to merge 1 commit into
devfrom
feat/add-dataset-eicu

Conversation

@mmcdermott

@mmcdermott mmcdermott commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Registers eICU (multi-center US ICU) in src/MEDS_DEV/datasets/eICU/. Extraction is via MEDS_extract-eICU from the upstream eICU-MEDS package — the one new dataset that ships an end-to-end demo recipe.

Files added:

  • dataset.yaml — metadata + both build_full and build_demo commands.
  • predicates.yaml — admission/discharge plus a wider lab/vital predicate set.
  • requirements.txtbumped from the contributor's eICU-MEDS==0.0.1 to the latest eICU-MEDS==0.0.2. The 0.0.1 demo end-to-end failed in CI on the bundled PR Add six new datasets (EHRShot, HIRID, INSPIRE, NWICU, SICdb, eICU) + complete AUMCdb #299 — the shard_events stage couldn't find its parquet lock under demo mode. Hoping the 0.0.2 release picks up either an upstream bug-fix or a MEDS_transforms compatibility update.
  • refs.bib — Pollard et al., 2018.
  • README.md.

Since eICU declares build_demo, the integration test matrix exercises it (no opt-out needed; #312's mechanism only skips datasets that omit build_demo entirely).

eICU is not added to any task's supported_datasets in this PR (worth a follow-up to wire it into mortality/in_icu/first_24h once we've confirmed the demo extraction produces well-formed predicates).

Depends on #312

Targeted at feat/dataset-demo-availability for now; once #312 merges, this PR retargets to dev. (eICU itself doesn't need #312's relaxation, but staying stacked keeps the seven dataset PRs visually consistent.)

Test plan

Supersedes / refs

🤖 Generated with Claude Code

@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
54 1 53 0
View the top 1 failed test(s) by shortest run time
tests.test_0_datasets::test_datasets_configured[eICU]
Stack Traces | 821s run time
request = <SubRequest 'demo_dataset' for <Function test_datasets_configured[eICU]>>
venv_cache = PosixPath('.../tmp/tmpipphar22/venvs')

    @pytest.fixture(scope="session")
    def demo_dataset(request, venv_cache: Path) -> NAME_AND_DIR:
        dataset_name = request.param
        persistent_cache_dir, (cache_datasets, _, _) = get_and_validate_cache_settings(request)
        reuse_datasets, _, _ = get_and_validate_reuse_settings(request)
    
        do_overwrite = dataset_name not in reuse_datasets
    
        with cache_dir(persistent_cache_dir if dataset_name in cache_datasets else None) as root_dir:
            root_dir = Path(root_dir)
    
            check_fp = root_dir / f".{dataset_name}.check"
            output_dir = root_dir / dataset_name
    
            data_exists = (output_dir / "data").is_dir()
            metadata_exists = (output_dir / "metadata").is_dir()
    
            already_tested = check_fp.exists() and data_exists and metadata_exists
            if do_overwrite or not already_tested:
                venv_dir = venv_cache / "datasets" / dataset_name
    
>               run_command(
                    "meds-dev-dataset",
                    test_name=f"Build {dataset_name}",
                    hydra_kwargs={
                        "dataset": dataset_name,
                        "output_dir": str(output_dir.resolve()),
                        "demo": True,
                        "venv_dir": str(venv_dir.resolve()),
                    },
                )

tests/conftest.py:380: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

script = ['meds-dev-dataset', 'dataset=eICU output_dir=.........................../tmp/tmpcx93rbxs/eICU demo=true venv_dir=....../venvs/datasets/eICU']
test_name = 'Build eICU'
hydra_kwargs = {'dataset': 'eICU', 'demo': True, 'output_dir': '.........................../tmp/tmpcx93rbxs/eICU', 'venv_dir': '....../venvs/datasets/eICU'}
config_name = None, should_error = False, want_err_msg = None
do_use_config_yaml = False

    def run_command(
        script: Path | str,
        test_name: str,
        hydra_kwargs: dict[str, str] | None = None,
        config_name: str | None = None,
        should_error: bool = False,
        want_err_msg: str | None = None,
        do_use_config_yaml: bool = False,
    ):
        script = ["python", str(script.resolve())] if isinstance(script, Path) else [script]
        command_parts = script
    
        err_cmd_lines = []
    
        if config_name is not None and not config_name.startswith("_"):
            config_name = f"_{config_name}"
    
        if hydra_kwargs is None:
            hydra_kwargs = {}
    
        if do_use_config_yaml:
            if config_name is None:
                raise ValueError("config_name must be provided if do_use_config_yaml is True.")
    
            conf = OmegaConf.create(
                {
                    "defaults": [config_name],
                    **hydra_kwargs,
                }
            )
    
            conf_dir = tempfile.TemporaryDirectory()
            conf_path = Path(conf_dir.name) / "config.yaml"
            OmegaConf.save(conf, conf_path)
    
            command_parts.extend(
                [
                    f"--config-path={conf_path.parent.resolve()!s}",
                    "--config-name=config",
                    "'hydra.searchpath=[pkg://MEDS_transforms.configs]'",
                ]
            )
            err_cmd_lines.append(f"Using config yaml:\n{OmegaConf.to_yaml(conf)}")
        else:
            if config_name is not None:
                command_parts.append(f"--config-name={config_name}")
            command_parts.append(" ".join(dict_to_hydra_kwargs(hydra_kwargs)))
    
        full_cmd = " ".join(command_parts)
        err_cmd_lines.append(f"Running command: {full_cmd}")
        command_out = subprocess.run(full_cmd, shell=True, capture_output=True)
    
        command_errored = command_out.returncode != 0
    
        stderr = command_out.stderr.decode()
        err_cmd_lines.append(f"stderr:\n{stderr}")
        stdout = command_out.stdout.decode()
        err_cmd_lines.append(f"stdout:\n{stdout}")
    
        if should_error:
            err_cmd_str = "\n".join(err_cmd_lines)
            if not command_errored:
                if do_use_config_yaml:
                    conf_dir.cleanup()
                raise AssertionError(f"{test_name} failed as command did not error when expected!\n{err_cmd_str}")
            if want_err_msg is not None and want_err_msg not in stderr:
                if do_use_config_yaml:
                    conf_dir.cleanup()
                raise AssertionError(
                    f"{test_name} failed as expected error message not found in stderr!\n{err_cmd_str}"
                )
        elif not should_error and command_errored:
            if do_use_config_yaml:
                conf_dir.cleanup()
>           raise AssertionError(
                f"{test_name} failed as command errored when not expected!\n" + "\n".join(err_cmd_lines)
            )
E           AssertionError: Build eICU failed as command errored when not expected!
E           Running command: meds-dev-dataset dataset=eICU output_dir=.........................../tmp/tmpcx93rbxs/eICU demo=true venv_dir=....../venvs/datasets/eICU
E           stderr:
E           Using CPython 3.11.15 interpreter at: .venv/bin/python
E           Creating virtual environment at: ....../venvs/datasets/eICU
E           Activate with: ....../venvs/datasets/eICU/bin/activate
E           Using Python 3.11.15 environment at: ....../venvs/datasets/eICU
E           Resolved 25 packages in 241ms
E           Downloading numpy (16.1MiB)
E           Downloading pyarrow (46.6MiB)
E           Downloading polars (33.8MiB)
E            Downloaded polars
E            Downloaded pyarrow
E            Downloaded numpy
E           Prepared 16 packages in 996ms
E           Installed 25 packages in 30ms
E            + antlr4-python3-runtime==4.9.3
E            + attrs==26.1.0
E            + beautifulsoup4==4.14.3
E            + certifi==2026.4.22
E            + charset-normalizer==3.4.7
E            + eicu-meds==0.0.2
E            + filelock==3.29.0
E            + hydra-core==1.3.2
E            + idna==3.15
E            + jsonschema==4.26.0
E            + jsonschema-specifications==2025.9.1
E            + meds==0.3.3
E            + meds-transforms==0.2.4
E            + numpy==2.4.4
E            + omegaconf==2.3.0
E            + packaging==26.2
E            + polars==1.26.0
E            + pyarrow==24.0.0
E            + pyyaml==6.0.3
E            + referencing==0.37.0
E            + requests==2.34.0
E            + rpds-py==0.30.0
E            + soupsieve==2.8.3
E            + typing-extensions==4.15.0
E            + urllib3==2.7.0
E           Error executing job with overrides: ['dataset=eICU', 'output_dir=.........................../tmp/tmpcx93rbxs/eICU', 'demo=true', 'venv_dir=....../venvs/datasets/eICU']
E           Traceback (most recent call last):
E             File ".../MEDS-DEV/MEDS-DEV/.venv/bin/meds-dev-dataset", line 10, in <module>
E               sys.exit(main())
E                        ^^^^^^
E             File ".../MEDS-DEV/MEDS-DEV/.venv/lib/python3.11.../site-packages/hydra/main.py", line 94, in decorated_main
E               _run_hydra(
E             File ".../MEDS-DEV/MEDS-DEV/.venv/lib/python3.11.../hydra/_internal/utils.py", line 394, in _run_hydra
E               _run_app(
E             File ".../MEDS-DEV/MEDS-DEV/.venv/lib/python3.11.../hydra/_internal/utils.py", line 457, in _run_app
E               run_and_report(
E             File ".../MEDS-DEV/MEDS-DEV/.venv/lib/python3.11.../hydra/_internal/utils.py", line 223, in run_and_report
E               raise ex
E             File ".../MEDS-DEV/MEDS-DEV/.venv/lib/python3.11.../hydra/_internal/utils.py", line 220, in run_and_report
E               return func()
E                      ^^^^^^
E             File ".../MEDS-DEV/MEDS-DEV/.venv/lib/python3.11.../hydra/_internal/utils.py", line 458, in <lambda>
E               lambda: hydra.run(
E                       ^^^^^^^^^^
E             File ".../MEDS-DEV/MEDS-DEV/.venv/lib/python3.11.../hydra/_internal/hydra.py", line 132, in run
E               _ = ret.return_value
E                   ^^^^^^^^^^^^^^^^
E             File ".../MEDS-DEV/MEDS-DEV/.venv/lib/python3.11.../hydra/core/utils.py", line 260, in return_value
E               raise self._return_value
E             File ".../MEDS-DEV/MEDS-DEV/.venv/lib/python3.11.../hydra/core/utils.py", line 186, in run_job
E               ret.return_value = task_function(task_cfg)
E                                  ^^^^^^^^^^^^^^^^^^^^^^^
E             File ".../MEDS_DEV/datasets/__main__.py", line 42, in main
E               run_in_env(
E             File ".../src/MEDS_DEV/utils.py", line 422, in run_in_env
E               raise RuntimeError(
E           RuntimeError: Command failed with exit code 1:
E           SCRIPT:
E           #!/bin/bash
E           set -e
E           MEDS_extract-eICU do_demo=True raw_input_dir="............/tmp/tmpehdkn2a5/raw" pre_MEDS_dir="......................../tmp/tmpehdkn2a5/pre_MEDS" MEDS_cohort_dir=".........................../tmp/tmpcx93rbxs/eICU" log_dir="................../tmpcx93rbxs/eICU/.pipeline_logs"
E           STDERR:
E           Error executing job with overrides: ['do_demo=True', 'raw_input_dir=............/tmp/tmpehdkn2a5/raw', 'pre_MEDS_dir=......................../tmp/tmpehdkn2a5/pre_MEDS', 'MEDS_cohort_dir=.........................../tmp/tmpcx93rbxs/eICU', 'log_dir=................../tmpcx93rbxs/eICU/.pipeline_logs']
E           Traceback (most recent call last):
E             File "....../venvs/datasets/eICU/lib/python3.11.../site-packages/eICU_MEDS/__main__.py", line 87, in main
E               run_command(command_parts, cfg)
E             File "....../venvs/datasets/eICU/lib/python3.11.../site-packages/eICU_MEDS/commands.py", line 80, in run_command
E               raise ValueError(f"Command failed with return code {command_out.returncode}.")
E           ValueError: Command failed with return code 1.
E           
E           Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
E           
E           STDOUT:
E           [2026-05-13 16:35:24,167][eICU_MEDS.__main__][INFO] - Downloading demo data.
E           [2026-05-13 16:43:41,758][eICU_MEDS.download][INFO] - Downloaded: ........./raw/sqlite/eicu_v2_0_1.sqlite3.gz
E           [2026-05-13 16:43:41,943][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/LICENSE.txt
E           [2026-05-13 16:43:42,013][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/SHA256SUMS.txt
E           [2026-05-13 16:43:42,545][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/admissionDx.csv.gz
E           [2026-05-13 16:43:43,149][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/admissiondrug.csv.gz
E           [2026-05-13 16:43:43,382][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/allergy.csv.gz
E           [2026-05-13 16:43:43,773][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/apacheApsVar.csv.gz
E           [2026-05-13 16:43:44,817][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/apachePatientResult.csv.gz
E           [2026-05-13 16:43:45,167][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/apachePredVar.csv.gz
E           [2026-05-13 16:43:45,577][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/carePlanCareProvider.csv.gz
E           [2026-05-13 16:43:45,639][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/carePlanEOL.csv.gz
E           [2026-05-13 16:43:47,383][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/carePlanGeneral.csv.gz
E           [2026-05-13 16:43:47,633][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/carePlanGoal.csv.gz
E           [2026-05-13 16:43:47,697][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/carePlanInfectiousDisease.csv.gz
E           [2026-05-13 16:43:47,751][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/customLab.csv.gz
E           [2026-05-13 16:43:49,599][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/diagnosis.csv.gz
E           [2026-05-13 16:43:49,656][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/hospital.csv.gz
E           [2026-05-13 16:43:51,638][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/infusiondrug.csv.gz
E           [2026-05-13 16:44:00,500][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/intakeOutput.csv.gz
E           [2026-05-13 16:44:29,561][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/lab.csv.gz
E           [2026-05-13 16:44:37,042][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/medication.csv.gz
E           [2026-05-13 16:44:37,110][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/microLab.csv.gz
E           [2026-05-13 16:44:38,505][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/note.csv.gz
E           [2026-05-13 16:44:44,229][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/nurseAssessment.csv.gz
E           [2026-05-13 16:44:47,250][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/nurseCare.csv.gz
E           [2026-05-13 16:46:17,029][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/nurseCharting.csv.gz
E           [2026-05-13 16:46:17,824][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/pastHistory.csv.gz
E           [2026-05-13 16:46:18,609][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/patient.csv.gz
E           [2026-05-13 16:46:22,973][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/physicalExam.csv.gz
E           [2026-05-13 16:46:23,427][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/respiratoryCare.csv.gz
E           [2026-05-13 16:46:36,093][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/respiratoryCharting.csv.gz
E           [2026-05-13 16:46:39,395][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/treatment.csv.gz
E           [2026-05-13 16:46:55,873][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/vitalAperiodic.csv.gz
E           [2026-05-13 16:48:52,908][eICU_MEDS.download][INFO] - Downloaded: ........./tmpehdkn2a5/raw/vitalPeriodic.csv.gz
E           [2026-05-13 16:48:52,909][eICU_MEDS.__main__][INFO] - Running pre_MEDS data wrangling.
E           [2026-05-13 16:48:52,909][eICU_MEDS.pre_MEDS][INFO] - Loading table preprocessors from ....../venvs/datasets/eICU/lib/python3.11.../eICU_MEDS/configs/table_preprocessors.yaml...
E           [2026-05-13 16:48:52,935][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for admissionDx:
E           offset_col: admitdxenteredoffset
E           pseudotime_col: admitDxEnteredTimestamp
E           output_data_cols:
E           - admitdxname
E           - admissiondxid
E           warning_items:
E           - How should we use `admitdxtest`?
E           - How should we use `admitdxpath`?
E           
E           [2026-05-13 16:48:52,935][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for allergy:
E           offset_col: allergyenteredoffset
E           pseudotime_col: allergyEnteredTimestamp
E           output_data_cols:
E           - allergytype
E           - allergyname
E           warning_items:
E           - How should we use `allergyNoteType`?
E           - How should we use `specialtyType`?
E           - How should we use `userType`?
E           - Is `drugName` the name of the drug to which the patient is allergic or the drug
E             given to the patient (docs say 'name of the selected admission drug')?
E           
E           [2026-05-13 16:48:52,936][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for carePlanGeneral:
E           offset_col: cplitemoffset
E           pseudotime_col: carePlanGeneralItemEnteredTimestamp
E           output_data_cols:
E           - cplgroup
E           - cplitemvalue
E           
E           [2026-05-13 16:48:52,936][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for carePlanEOL:
E           offset_col: cpleoldiscussionoffset
E           pseudotime_col: carePlanEolDiscussionOccurredTimestamp
E           warning_items:
E           - Is the DiscussionOffset time actually reliable? Should we fall back on the SaveOffset
E             time?
E           
E           [2026-05-13 16:48:52,937][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for carePlanGoal:
E           offset_col: cplgoaloffset
E           pseudotime_col: carePlanGoalEnteredTimestamp
E           output_data_cols:
E           - cplgoalcategory
E           - cplgoalvalue
E           - cplgoalstatus
E           
E           [2026-05-13 16:48:52,937][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for carePlanInfectiousDisease:
E           offset_col: cplinfectdiseaseoffset
E           pseudotime_col: carePlanInfectDiseaseEnteredTimestamp
E           output_data_cols:
E           - infectdiseasesite
E           - infectdiseaseassessment
E           - responsetotherapy
E           - treatment
E           
E           [2026-05-13 16:48:52,938][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for diagnosis:
E           offset_col: diagnosisoffset
E           pseudotime_col: diagnosisEnteredTimestamp
E           output_data_cols:
E           - icd9code
E           - diagnosispriority
E           - diagnosisstring
E           warning_items:
E           - Though we use it, the `diagnosisString` field documentation is unclear -- by what
E             is it separated?
E           
E           [2026-05-13 16:48:52,938][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for infusionDrug:
E           offset_col: infusionoffset
E           pseudotime_col: infusionEnteredTimestamp
E           output_data_cols:
E           - infusiondrugid
E           - drugname
E           - drugrate
E           - infusionrate
E           - drugamount
E           - volumeoffluid
E           - patientweight
E           
E           [2026-05-13 16:48:52,939][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for lab:
E           offset_col: labresultoffset
E           pseudotime_col: labResultDrawnTimestamp
E           output_data_cols:
E           - labname
E           - labresult
E           - labresulttext
E           - labmeasurenamesystem
E           - labmeasurenameinterface
E           - labtypeid
E           warning_items:
E           - Is this the time the lab was drawn? Entered? The time the result came in?
E           - We **IGNORE** the `labResultRevisedOffset` column -- this may be a mistake!
E           
E           [2026-05-13 16:48:52,940][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for medication:
E           offset_col:
E           - drugorderoffset
E           - drugstartoffset
E           - drugstopoffset
E           pseudotime_col:
E           - drugordertimestamp
E           - drugstarttimestamp
E           - drugstoptimestamp
E           output_data_cols:
E           - medicationid
E           - drugivadmixture
E           - drugname
E           - drughiclseqno
E           - dosage
E           - routeadmin
E           - frequency
E           - loadingdose
E           - prn
E           - gtc
E           warning_items:
E           - We **IGNORE** the `drugOrderCancelled` column -- this may be a mistake!
E           
E           [2026-05-13 16:48:52,940][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for nurseAssessment:
E           offset_col:
E           - nurseassessoffset
E           - nurseassessentryoffset
E           pseudotime_col:
E           - nurseAssessPerformedTimestamp
E           - nurseAssessEnteredTimestamp
E           output_data_cols:
E           - nurseassessid
E           - celllabel
E           - cellattribute
E           - cellattributevalue
E           warning_items:
E           - Should we be using `cellAttributePath` instead of `cellAttribute`?
E           - SOME MAY BE LISTS
E           
E           [2026-05-13 16:48:52,941][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for nurseCare:
E           offset_col:
E           - nursecareoffset
E           - nursecareentryoffset
E           pseudotime_col:
E           - nurseCarePerformedTimestamp
E           - nurseCareEnteredTimestamp
E           output_data_cols:
E           - nursecareid
E           - celllabel
E           - cellattribute
E           - cellattributevalue
E           warning_items:
E           - Should we be using `cellAttributePath` instead of `cellAttribute`?
E           - SOME MAY BE LISTS
E           
E           [2026-05-13 16:48:52,941][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for nurseCharting:
E           offset_col:
E           - nursingchartoffset
E           - nursingchartentryoffset
E           pseudotime_col:
E           - nursingChartPerformedTimestamp
E           - nursingChartEnteredTimestamp
E           output_data_cols:
E           - nursingchartid
E           - nursingchartcelltypecat
E           - nursingchartcelltypevalname
E           - nursingchartcelltypevallabel
E           - nursingchartvalue
E           warning_items:
E           - SOME MAY BE LISTS
E           
E           [2026-05-13 16:48:52,942][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for pastHistory:
E           offset_col:
E           - pasthistoryoffset
E           - pasthistoryenteredoffset
E           pseudotime_col:
E           - pastHistoryTakenTimestamp
E           - pastHistoryEnteredTimestamp
E           output_data_cols:
E           - pasthistoryid
E           - pasthistorynotetype
E           - pasthistorypath
E           - pasthistoryvalue
E           - pasthistoryvaluetext
E           warning_items:
E           - SOME MAY BE LISTS
E           - How should we use `pastHistoryPath` vs. `pastHistoryNoteType`?
E           - How should we use `pastHistoryValue` vs. `pastHistoryValueText`?
E           
E           [2026-05-13 16:48:52,943][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for physicalExam:
E           offset_col: physicalexamoffset
E           pseudotime_col: physicalExamEnteredTimestamp
E           output_data_cols:
E           - physicalexamid
E           - physicalexamtext
E           - physicalexampath
E           - physicalexamvalue
E           warning_items:
E           - How should we use `physicalExamValue` vs. `physicalExamText`?
E           - I believe the `physicalExamValue` is a **LIST**. This must be processed specially.
E           
E           [2026-05-13 16:48:52,943][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for respiratoryCharting:
E           offset_col:
E           - respchartoffset
E           - respchartentryoffset
E           pseudotime_col:
E           - respChartPerformedTimestamp
E           - respChartEnteredTimestamp
E           output_data_cols:
E           - respchartid
E           - respcharttypecat
E           - respchartvaluelabel
E           - respchartvalue
E           warning_items:
E           - SOME MAY BE LISTS
E           
E           [2026-05-13 16:48:52,944][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for treatment:
E           offset_col: treatmentoffset
E           pseudotime_col: treatmentEnteredTimestamp
E           output_data_cols:
E           - treatmentid
E           - treatmentstring
E           warning_items:
E           - Absence of entries in table do not indicate absence of treatments
E           
E           [2026-05-13 16:48:52,944][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for vitalAperiodic:
E           offset_col: observationoffset
E           pseudotime_col: observationEnteredTimestamp
E           output_data_cols:
E           - vitalaperiodicid
E           - noninvasivesystolic
E           - noninvasivediastolic
E           - noninvasivemean
E           - paop
E           - cardiacoutput
E           - cardiacinput
E           - svr
E           - svri
E           - pvr
E           - pvri
E           
E           [2026-05-13 16:48:52,945][eICU_MEDS.pre_MEDS][INFO] -   Adding preprocessor for vitalPeriodic:
E           offset_col: observationoffset
E           pseudotime_col: observationEnteredTimestamp
E           output_data_cols:
E           - vitalperiodicid
E           - temperature
E           - sao2
E           - heartrate
E           - respiration
E           - cvp
E           - etco2
E           - systemicsystolic
E           - systemicdiastolic
E           - systemicmean
E           - pasystolic
E           - padiastolic
E           - pamean
E           - st1
E           - st2
E           - st3
E           - icp
E           warning_items:
E           - These are 5-minute median values. There are going to be a *lot* of events.
E           
E           [2026-05-13 16:48:52,945][eICU_MEDS.pre_MEDS][INFO] - Processing patient table first...
E           [2026-05-13 16:48:52,945][eICU_MEDS.pre_MEDS][INFO] - Loading ........./tmpehdkn2a5/raw/hospital.csv.gz...
E           [2026-05-13 16:48:52,947][eICU_MEDS.pre_MEDS][INFO] - Loading ........./tmpehdkn2a5/raw/patient.csv.gz...
E           [2026-05-13 16:48:52,958][eICU_MEDS.pre_MEDS][INFO] - Processing patient table...
E           [2026-05-13 16:48:52,960][eICU_MEDS.pre_MEDS][INFO] - Checking that the 24h times are consistent. If this is extremely slow, consider refactoring to have only one `.collect()` call.
E           [2026-05-13 16:48:52,960][eICU_MEDS.pre_MEDS][INFO] - Checking that stated 24h times are consistent given offsets between {pseudotime_col.name} and hospitaldischargetime24...
E           [2026-05-13 16:48:52,961][eICU_MEDS.pre_MEDS][INFO] - Checking that stated 24h times are consistent given offsets between {pseudotime_col.name} and hospitaladmittime24...
E           [2026-05-13 16:48:52,963][eICU_MEDS.pre_MEDS][INFO] - Checking that stated 24h times are consistent given offsets between {pseudotime_col.name} and unitadmittime24...
E           [2026-05-13 16:48:52,965][eICU_MEDS.pre_MEDS][INFO] - Checking that stated 24h times are consistent given offsets between {pseudotime_col.name} and unitdischargetime24...
E           [2026-05-13 16:48:52,966][eICU_MEDS.pre_MEDS][INFO] - Validated 24h times in 0:00:00.006702
E           [2026-05-13 16:48:52,966][eICU_MEDS.pre_MEDS][WARNING] - NOT validating the `unitVisitNumber` column as that isn't implemented yet.
E           [2026-05-13 16:48:52,967][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING. Check with the eICU team:
E             - `apacheAdmissionDx` is not selected from the patients table as we grab it from `admissiondx`. Is this right?
E             - `admissionHeight` and `admissionWeight` are interpreted as **unit** admission height/weight, not hospital admission height/weight. Is this right?
E             - `age` is interpreted as the age at the time of the unit stay, not the hospital stay. Is this right?
E             - `What is the actual mean age for those > 89? Here we assume 90.
E             - Note that all the column names appear to be all in lowercase for the csv versions, vs. the docs
E           [2026-05-13 16:48:52,988][eICU_MEDS.pre_MEDS][INFO] - Processing medication...
E           [2026-05-13 16:48:53,123][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/medication.csv.gz in 0:00:00.134833
E           [2026-05-13 16:48:53,124][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for medication table. Check with the eICU team:
E             - We **IGNORE** the `drugOrderCancelled` column -- this may be a mistake!
E           [2026-05-13 16:48:53,178][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/medication.parquet in 0:00:00.189356
E           [2026-05-13 16:48:53,178][eICU_MEDS.pre_MEDS][INFO] - Processing treatment...
E           [2026-05-13 16:48:53,209][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/treatment.csv.gz in 0:00:00.031399
E           [2026-05-13 16:48:53,210][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for treatment table. Check with the eICU team:
E             - Absence of entries in table do not indicate absence of treatments
E           [2026-05-13 16:48:53,225][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/treatment.parquet in 0:00:00.046771
E           [2026-05-13 16:48:53,225][eICU_MEDS.pre_MEDS][INFO] - Processing careplangoal...
E           [2026-05-13 16:48:53,229][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/carePlanGoal.csv.gz in 0:00:00.003867
E           [2026-05-13 16:48:53,234][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/careplangoal.parquet in 0:00:00.009486
E           [2026-05-13 16:48:53,235][eICU_MEDS.pre_MEDS][INFO] - Processing pasthistory...
E           [2026-05-13 16:48:53,249][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/pastHistory.csv.gz in 0:00:00.014628
E           [2026-05-13 16:48:53,249][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for pastHistory table. Check with the eICU team:
E             - SOME MAY BE LISTS
E             - How should we use `pastHistoryPath` vs. `pastHistoryNoteType`?
E             - How should we use `pastHistoryValue` vs. `pastHistoryValueText`?
E           [2026-05-13 16:48:53,259][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/pasthistory.parquet in 0:00:00.024299
E           [2026-05-13 16:48:53,259][eICU_MEDS.pre_MEDS][WARNING] - No function needed for apachepredvar. For eICU, THIS IS UNEXPECTED
E           [2026-05-13 16:48:53,259][eICU_MEDS.pre_MEDS][INFO] - Processing careplaneol...
E           [2026-05-13 16:48:53,259][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/carePlanEOL.csv.gz in 0:00:00.000310
E           [2026-05-13 16:48:53,260][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for carePlanEOL table. Check with the eICU team:
E             - Is the DiscussionOffset time actually reliable? Should we fall back on the SaveOffset time?
E           [2026-05-13 16:48:53,262][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/careplaneol.parquet in 0:00:00.003305
E           [2026-05-13 16:48:53,263][eICU_MEDS.pre_MEDS][INFO] - Processing infusiondrug...
E           [2026-05-13 16:48:53,300][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/infusiondrug.csv.gz in 0:00:00.037671
E           [2026-05-13 16:48:53,320][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/infusiondrug.parquet in 0:00:00.057438
E           [2026-05-13 16:48:53,320][eICU_MEDS.pre_MEDS][INFO] - Processing admissiondx...
E           [2026-05-13 16:48:53,328][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/admissionDx.csv.gz in 0:00:00.007879
E           [2026-05-13 16:48:53,328][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for admissionDx table. Check with the eICU team:
E             - How should we use `admitdxtest`?
E             - How should we use `admitdxpath`?
E           [2026-05-13 16:48:53,334][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/admissiondx.parquet in 0:00:00.013944
E           [2026-05-13 16:48:53,334][eICU_MEDS.pre_MEDS][WARNING] - No function needed for respiratorycare. For eICU, THIS IS UNEXPECTED
E           [2026-05-13 16:48:53,334][eICU_MEDS.pre_MEDS][INFO] - Processing lab...
E           [2026-05-13 16:48:53,879][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/lab.csv.gz in 0:00:00.544326
E           [2026-05-13 16:48:53,879][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for lab table. Check with the eICU team:
E             - Is this the time the lab was drawn? Entered? The time the result came in?
E             - We **IGNORE** the `labResultRevisedOffset` column -- this may be a mistake!
E           [2026-05-13 16:48:54,016][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/lab.parquet in 0:00:00.681302
E           [2026-05-13 16:48:54,016][eICU_MEDS.pre_MEDS][WARNING] - No function needed for careplancareprovider. For eICU, THIS IS UNEXPECTED
E           [2026-05-13 16:48:54,016][eICU_MEDS.pre_MEDS][INFO] - Processing careplangeneral...
E           [2026-05-13 16:48:54,044][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/carePlanGeneral.csv.gz in 0:00:00.027849
E           [2026-05-13 16:48:54,057][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/careplangeneral.parquet in 0:00:00.040582
E           [2026-05-13 16:48:54,057][eICU_MEDS.pre_MEDS][WARNING] - No function needed for intakeoutput. For eICU, THIS IS UNEXPECTED
E           [2026-05-13 16:48:54,057][eICU_MEDS.pre_MEDS][INFO] - Processing vitalaperiodic...
E           [2026-05-13 16:48:54,292][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/vitalAperiodic.csv.gz in 0:00:00.235637
E           [2026-05-13 16:48:54,372][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/vitalaperiodic.parquet in 0:00:00.314957
E           [2026-05-13 16:48:54,372][eICU_MEDS.pre_MEDS][WARNING] - No function needed for apacheapsvar. For eICU, THIS IS UNEXPECTED
E           [2026-05-13 16:48:54,372][eICU_MEDS.pre_MEDS][INFO] - Processing physicalexam...
E           [2026-05-13 16:48:54,450][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/physicalExam.csv.gz in 0:00:00.077769
E           [2026-05-13 16:48:54,450][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for physicalExam table. Check with the eICU team:
E             - How should we use `physicalExamValue` vs. `physicalExamText`?
E             - I believe the `physicalExamValue` is a **LIST**. This must be processed specially.
E           [2026-05-13 16:48:54,481][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/physicalexam.parquet in 0:00:00.108978
E           [2026-05-13 16:48:54,481][eICU_MEDS.pre_MEDS][INFO] - Processing careplaninfectiousdisease...
E           [2026-05-13 16:48:54,482][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/carePlanInfectiousDisease.csv.gz in 0:00:00.000565
E           [2026-05-13 16:48:54,486][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/careplaninfectiousdisease.parquet in 0:00:00.004494
E           [2026-05-13 16:48:54,486][eICU_MEDS.pre_MEDS][INFO] - Processing vitalperiodic...
E           [2026-05-13 16:48:56,394][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/vitalPeriodic.csv.gz in 0:00:01.908234
E           [2026-05-13 16:48:56,394][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for vitalPeriodic table. Check with the eICU team:
E             - These are 5-minute median values. There are going to be a *lot* of events.
E           [2026-05-13 16:48:56,952][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/vitalperiodic.parquet in 0:00:02.466280
E           [2026-05-13 16:48:56,952][eICU_MEDS.pre_MEDS][INFO] - Processing nurseassessment...
E           [2026-05-13 16:48:57,060][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/nurseAssessment.csv.gz in 0:00:00.107457
E           [2026-05-13 16:48:57,060][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for nurseAssessment table. Check with the eICU team:
E             - Should we be using `cellAttributePath` instead of `cellAttribute`?
E             - SOME MAY BE LISTS
E           [2026-05-13 16:48:57,096][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/nurseassessment.parquet in 0:00:00.143725
E           [2026-05-13 16:48:57,096][eICU_MEDS.pre_MEDS][WARNING] - No function needed for customlab. For eICU, THIS IS UNEXPECTED
E           [2026-05-13 16:48:57,096][eICU_MEDS.pre_MEDS][WARNING] - No function needed for microlab. For eICU, THIS IS UNEXPECTED
E           [2026-05-13 16:48:57,097][eICU_MEDS.pre_MEDS][INFO] - Processing respiratorycharting...
E           [2026-05-13 16:48:57,260][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/respiratoryCharting.csv.gz in 0:00:00.163816
E           [2026-05-13 16:48:57,261][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for respiratoryCharting table. Check with the eICU team:
E             - SOME MAY BE LISTS
E           [2026-05-13 16:48:57,329][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/respiratorycharting.parquet in 0:00:00.232721
E           [2026-05-13 16:48:57,330][eICU_MEDS.pre_MEDS][INFO] - Processing nursecare...
E           [2026-05-13 16:48:57,379][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/nurseCare.csv.gz in 0:00:00.049365
E           [2026-05-13 16:48:57,379][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for nurseCare table. Check with the eICU team:
E             - Should we be using `cellAttributePath` instead of `cellAttribute`?
E             - SOME MAY BE LISTS
E           [2026-05-13 16:48:57,399][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/nursecare.parquet in 0:00:00.069161
E           [2026-05-13 16:48:57,399][eICU_MEDS.pre_MEDS][WARNING] - No function needed for apachepatientresult. For eICU, THIS IS UNEXPECTED
E           [2026-05-13 16:48:57,399][eICU_MEDS.pre_MEDS][WARNING] - Skipping admissiondrug as it is not supported in this pipeline.
E           [2026-05-13 16:48:57,399][eICU_MEDS.pre_MEDS][WARNING] - Skipping note as it is not supported in this pipeline.
E           [2026-05-13 16:48:57,399][eICU_MEDS.pre_MEDS][INFO] - Processing allergy...
E           [2026-05-13 16:48:57,404][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/allergy.csv.gz in 0:00:00.004442
E           [2026-05-13 16:48:57,404][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for allergy table. Check with the eICU team:
E             - How should we use `allergyNoteType`?
E             - How should we use `specialtyType`?
E             - How should we use `userType`?
E             - Is `drugName` the name of the drug to which the patient is allergic or the drug given to the patient (docs say 'name of the selected admission drug')?
E           [2026-05-13 16:48:57,408][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/allergy.parquet in 0:00:00.009302
E           [2026-05-13 16:48:57,409][eICU_MEDS.pre_MEDS][INFO] - Processing nursecharting...
E           [2026-05-13 16:48:59,007][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/nurseCharting.csv.gz in 0:00:01.598056
E           [2026-05-13 16:48:59,007][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for nurseCharting table. Check with the eICU team:
E             - SOME MAY BE LISTS
E           [2026-05-13 16:48:59,508][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/nursecharting.parquet in 0:00:02.099422
E           [2026-05-13 16:48:59,508][eICU_MEDS.pre_MEDS][INFO] - Processing diagnosis...
E           [2026-05-13 16:48:59,534][eICU_MEDS.pre_MEDS][INFO] -   * Loaded raw ........./tmpehdkn2a5/raw/diagnosis.csv.gz in 0:00:00.025905
E           [2026-05-13 16:48:59,534][eICU_MEDS.pre_MEDS][WARNING] - NOT SURE ABOUT THE FOLLOWING for diagnosis table. Check with the eICU team:
E             - Though we use it, the `diagnosisString` field documentation is unclear -- by what is it separated?
E           [2026-05-13 16:48:59,546][eICU_MEDS.pre_MEDS][INFO] -   * Processed and wrote to ........./tmpehdkn2a5/pre_MEDS/diagnosis.parquet in 0:00:00.037266
E           [2026-05-13 16:48:59,546][eICU_MEDS.pre_MEDS][INFO] - Done! All dataframes processed and written to ......................../tmp/tmpehdkn2a5/pre_MEDS
E           [2026-05-13 16:48:59,547][eICU_MEDS.__main__][INFO] - Running in serial mode as N_WORKERS is not set.
E           [2026-05-13 16:48:59,547][eICU_MEDS.commands][INFO] - Running command: DATASET_NAME=eICU DATASET_VERSION=2.0:0.0.2 EVENT_CONVERSION_CONFIG_FP=....../venvs/datasets/eICU/lib/python3.11.../eICU_MEDS/configs/event_configs.yaml PRE_MEDS_DIR=......................../tmp/tmpehdkn2a5/pre_MEDS MEDS_COHORT_DIR=.........................../tmp/tmpcx93rbxs/eICU MEDS_transform-runner --config-path=....../venvs/datasets/eICU/lib/python3.11........./site-packages/eICU_MEDS/configs --config-name=runner pipeline_config_fp=....../venvs/datasets/eICU/lib/python3.11....../eICU_MEDS/configs/ETL.yaml ~parallelize 'hydra.searchpath=[pkg://MEDS_transforms.configs]' ++do_overwrite=False
E           [2026-05-13 16:49:01,023][eICU_MEDS.commands][INFO] - Command output:
E           [2026-05-13 16:49:00,063][MEDS_transforms.runner][INFO] - Running stage: shard_events
E           [2026-05-13 16:49:00,143][MEDS_transforms.runner][INFO] - Running command: MEDS_extract-shard_events --config-dir=....../venvs/datasets/eICU/lib/python3.11........./site-packages/eICU_MEDS/configs --config-name=ETL 'hydra.searchpath=[pkg://MEDS_transforms.configs]' stage=shard_events
E           [2026-05-13 16:49:00,942][MEDS_transforms.runner][INFO] - Command output:
E           [2026-05-13 16:49:00,745][MEDS_transforms.extract.shard_events][INFO] - Running with config:
E           input_dir: ${oc.env:PRE_MEDS_DIR}
E           cohort_dir: ${oc.env:MEDS_COHORT_DIR}
E           _default_description: 'This is a MEDS pipeline ETL. Please set a more detailed description
E             at the top of your specific pipeline
E           
E             configuration file.'
E           log_dir: ${stage_cfg.output_dir}/.logs
E           do_overwrite: false
E           seed: 1
E           stages:
E           - shard_events
E           - split_and_shard_subjects
E           - convert_to_sharded_events
E           - merge_to_MEDS_cohort
E           - extract_code_metadata
E           - finalize_MEDS_metadata
E           - finalize_MEDS_data
E           stage_configs:
E             shard_events:
E               row_chunksize: 200000000
E               infer_schema_length: 999999999
E               data_input_dir: ${input_dir}
E             split_and_shard_subjects:
E               is_metadata: true
E               output_dir: ${cohort_dir}/metadata
E               n_subjects_per_shard: 1000
E               external_splits_json_fp: null
E               split_fracs:
E                 train: 0.8
E                 tuning: 0.1
E                 held_out: 0.1
E             convert_to_sharded_events:
E               do_dedup_text_and_numeric: true
E             merge_to_MEDS_cohort:
E               unique_by: '*'
E               additional_sort_by: null
E             extract_code_metadata:
E               is_metadata: true
E               description_separator: '
E           
E                 '
E             finalize_MEDS_metadata:
E               is_metadata: true
E               do_retype: true
E             finalize_MEDS_data:
E               do_retype: true
E           worker: 0
E           polling_time: 300
E           stage: shard_events
E           stage_cfg: ${oc.create:${populate_stage:${stage}, ${input_dir}, ${cohort_dir}, ${stages},
E             ${stage_configs}}}
E           etl_metadata:
E             pipeline_name: ???
E             dataset_name: ${oc.env:DATASET_NAME}
E             dataset_version: ${oc.env:DATASET_VERSION}
E             package_name: ${get_package_name:}
E             package_version: ${get_package_version:}
E           etl_metadata.pipeline_name: extract
E           description: "This pipeline extracts a dataset in longitudinal, sparse form from an\
E             \ input dataset meeting\nselect criteria and converts them to the MEDS format.\n\
E             \nThis pipeline is for the ${etl_metadata.dataset_name} dataset version ${etl_metadata.dataset_version}.\n\
E             \nKey variables for this pipeline are\n  - `input_dir=$PATH_TO_INPUT_DIR`\n  - `cohort_dir=$PATH_TO_OUTPUT_DIR`.\n\
E             \  - `event_conversion_config_fp=$PATH_TO_EVENT_CONVERSION_CONFIG_FP`\n\nSee the\
E             \ MEDS-Transforms Extract documentation for configuration file details."
E           event_conversion_config_fp: ${oc.env:EVENT_CONVERSION_CONFIG_FP}
E           shards_map_fp: ${cohort_dir}/metadata/.shards.json
E           parallelize:
E             n_workers: ${oc.env:N_WORKERS}
E             launcher: joblib
E           
E           Stage: shard_events
E           
E           Stage config:
E           row_chunksize: 200000000
E           infer_schema_length: 999999999
E           data_input_dir: ......................../tmp/tmpehdkn2a5/pre_MEDS
E           is_metadata: false
E           metadata_input_dir: ................../tmpehdkn2a5/pre_MEDS/metadata
E           output_dir: ......................../tmpcx93rbxs/eICU/shard_events
E           reducer_output_dir: null
E           
E           [2026-05-13 16:49:00,752][MEDS_transforms.extract.shard_events][INFO] - Reading event conversion config from ....../venvs/datasets/eICU/lib/python3.11.../eICU_MEDS/configs/event_configs.yaml to identify needed columns.
E           [2026-05-13 16:49:00,813][MEDS_transforms.extract.shard_events][WARNING] - Skipping ........./tmpehdkn2a5/pre_MEDS/pasthistory.parquet as it is not specified in the event conversion configuration.
E           [2026-05-13 16:49:00,813][MEDS_transforms.extract.shard_events][WARNING] - Skipping ........./tmpehdkn2a5/pre_MEDS/nurseassessment.parquet as it is not specified in the event conversion configuration.
E           [2026-05-13 16:49:00,813][MEDS_transforms.extract.shard_events][WARNING] - Skipping ........./tmpehdkn2a5/pre_MEDS/physicalexam.parquet as it is not specified in the event conversion configuration.
E           [2026-05-13 16:49:00,813][MEDS_transforms.extract.shard_events][WARNING] - Skipping ........./tmpehdkn2a5/pre_MEDS/nursecharting.parquet as it is not specified in the event conversion configuration.
E           [2026-05-13 16:49:00,813][MEDS_transforms.extract.shard_events][WARNING] - Skipping ........./tmpehdkn2a5/pre_MEDS/respiratorycharting.parquet as it is not specified in the event conversion configuration.
E           [2026-05-13 16:49:00,813][MEDS_transforms.extract.shard_events][WARNING] - Skipping ........./tmpehdkn2a5/pre_MEDS/nursecare.parquet as it is not specified in the event conversion configuration.
E           [2026-05-13 16:49:00,814][MEDS_transforms.extract.shard_events][INFO] - Starting event sub-sharding. Sub-sharding 14 files:
E             * ................../tmpehdkn2a5/pre_MEDS/patient.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/admissiondx.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/diagnosis.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/vitalperiodic.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/allergy.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/treatment.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/careplaninfectiousdisease.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/infusiondrug.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/careplaneol.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/vitalaperiodic.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/careplangoal.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/medication.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/careplangeneral.parquet
E             * ........./tmpehdkn2a5/pre_MEDS/lab.parquet
E           [2026-05-13 16:49:00,818][MEDS_transforms.extract.shard_events][INFO] - Will read raw data from ......................../tmp/tmpehdkn2a5/pre_MEDS/$IN_FILE.parquet and write sub-sharded data to ......................../tmpcx93rbxs/eICU/shard_events/$IN_FILE/$ROW_START-$ROW_END.parquet
E           [2026-05-13 16:49:00,821][MEDS_transforms.extract.shard_events][INFO] - Processing ................../tmpehdkn2a5/pre_MEDS/patient.parquet to ......................../tmpcx93rbxs/eICU/shard_events/patient.
E           [2026-05-13 16:49:00,821][MEDS_transforms.extract.shard_events][INFO] - Performing preliminary read of ................../tmpehdkn2a5/pre_MEDS/patient.parquet to determine row count.
E           [2026-05-13 16:49:00,825][MEDS_transforms.extract.shard_events][INFO] - Ignoring infer_schema_length=999999999 for Parquet files.
E           [2026-05-13 16:49:00,829][MEDS_transforms.extract.shard_events][INFO] - Read 2520 rows from ................../tmpehdkn2a5/pre_MEDS/patient.parquet.
E           [2026-05-13 16:49:00,829][MEDS_transforms.extract.shard_events][INFO] - Splitting ................../tmpehdkn2a5/pre_MEDS/patient.parquet into 1 row-chunks of size 200000000.
E           [2026-05-13 16:49:00,829][MEDS_transforms.extract.shard_events][INFO] - Writing file 1/1: ................../tmpehdkn2a5/pre_MEDS/patient.parquet row-chunk [0-2520) to ......................../tmpcx93rbxs/eICU/shard_events/patient/[0-2520).parquet.
E           [2026-05-13 16:49:00,833][MEDS_transforms.mapreduce.utils][INFO] - Reading input dataframe from ................../tmpehdkn2a5/pre_MEDS/patient.parquet
E           [2026-05-13 16:49:00,833][MEDS_transforms.extract.shard_events][INFO] - Ignoring infer_schema_length=999999999 for Parquet files.
E           [2026-05-13 16:49:00,834][MEDS_transforms.mapreduce.utils][INFO] - Read dataset
E           [2026-05-13 16:49:00,834][MEDS_transforms.mapreduce.utils][INFO] - Writing final output to ......................../tmpcx93rbxs/eICU/shard_events/patient/[0-2520).parquet
E           [2026-05-13 16:49:00,840][MEDS_transforms.mapreduce.utils][INFO] - Succeeded in 0:00:00.007066
E           
E           [2026-05-13 16:49:00,942][MEDS_transforms.runner][INFO] - Command error:
E           Error executing job with overrides: ['stage=shard_events']
E           Traceback (most recent call last):
E             File "....../venvs/datasets/eICU/lib/python3.11.../MEDS_transforms/extract/shard_events.py", line 411, in main
E               rwlock_wrap(
E             File "....../venvs/datasets/eICU/lib/python3.11.../MEDS_transforms/mapreduce/utils.py", line 173, in rwlock_wrap
E               lock_fp.unlink()
E             File ".../hostedtoolcache/Python/3.11.15.../x64/lib/python3.11/pathlib.py", line 1147, in unlink
E               os.unlink(self)
E           FileNotFoundError: [Errno 2] No such file or directory: '......................../tmpcx93rbxs/eICU/shard_events/patient/[0-2520).parquet.lock'
E           
E           Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
E           
E           
E           [2026-05-13 16:49:01,023][eICU_MEDS.commands][ERROR] - Command failed with return code 1.
E           [2026-05-13 16:49:01,023][eICU_MEDS.commands][ERROR] - Command stderr:
E           Error executing job with overrides: ['pipeline_config_fp=....../venvs/datasets/eICU/lib/python3.11....../eICU_MEDS/configs/ETL.yaml', '~parallelize', '++do_overwrite=False']
E           Traceback (most recent call last):
E             File "....../venvs/datasets/eICU/lib/python3.11....../site-packages/MEDS_transforms/runner.py", line 329, in main
E               run_stage(cfg, stage, default_parallelization_cfg=default_parallelization_cfg)
E             File "....../venvs/datasets/eICU/lib/python3.11....../site-packages/MEDS_transforms/runner.py", line 271, in run_stage
E               raise ValueError(
E           ValueError: Stage shard_events failed via MEDS_extract-shard_events --config-dir=....../venvs/datasets/eICU/lib/python3.11........./site-packages/eICU_MEDS/configs --config-name=ETL 'hydra.searchpath=[pkg://MEDS_transforms.configs]' stage=shard_events with return code 1.
E           
E           Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
E           
E           
E           
E           stdout:
E           [2026-05-13 16:35:21,279][MEDS_DEV.utils][INFO] - Installing requirements from .../datasets/eICU/requirements.txt into virtual environment.
E           [2026-05-13 16:35:22,642][MEDS_DEV.utils][INFO] - Installed requirements from .../datasets/eICU/requirements.txt into virtual environment.
E           [2026-05-13 16:35:22,642][MEDS_DEV.datasets.__main__][INFO] - Considering running build command: MEDS_extract-eICU do_demo=True raw_input_dir="............/tmp/tmpehdkn2a5/raw" pre_MEDS_dir="......................../tmp/tmpehdkn2a5/pre_MEDS" MEDS_cohort_dir=".........................../tmp/tmpcx93rbxs/eICU" log_dir="................../tmpcx93rbxs/eICU/.pipeline_logs"
E           [2026-05-13 16:35:22,642][MEDS_DEV.utils][INFO] - Running command in .........................../tmpcx93rbxs/eICU/cmd.sh:
E           #!/bin/bash
E           set -e
E           MEDS_extract-eICU do_demo=True raw_input_dir="............/tmp/tmpehdkn2a5/raw" pre_MEDS_dir="......................../tmp/tmpehdkn2a5/pre_MEDS" MEDS_cohort_dir=".........................../tmp/tmpcx93rbxs/eICU" log_dir="................../tmpcx93rbxs/eICU/.pipeline_logs"

tests/utils.py:146: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@mmcdermott mmcdermott changed the base branch from dev to feat/dataset-demo-availability May 13, 2026 16:34
@mmcdermott mmcdermott force-pushed the feat/add-dataset-eicu branch from 1e4dfd7 to 3260c05 Compare May 13, 2026 16:34
mmcdermott added a commit that referenced this pull request May 13, 2026
Prerequisite for the per-dataset registration PRs (#305 AUMCdb, #306
EHRShot, #307 HIRID, #308 INSPIRE, #309 NWICU, #310 SICdb, #311 eICU).
Most of those datasets' upstream extractors don't ship a publicly
installable demo, and the existing registry validation requires every
dataset to declare a build_demo command.

Switches the convention to: a dataset has a demo iff its commands
declare build_demo. Absence is the signal — no separate metadata field.

- `test_all_datasets_have_commands` now requires `build_full` (which
  every dataset still needs) and allows missing `build_demo`.
- `tests/conftest.py` drops datasets without `build_demo` from the
  integration test matrix, so a per-dataset CI lane for one collects
  zero parametrized tests and passes cleanly rather than trying to
  build data the dataset can't produce.
- `src/MEDS_DEV/datasets/__main__.py` raises a clear error when called
  with `demo=True` against a dataset that doesn't declare a
  build_demo command (instead of the previous KeyError).

No dataset.yaml files change here — those changes ship with the sister
per-dataset PRs that depend on this one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mmcdermott mmcdermott force-pushed the feat/dataset-demo-availability branch from 17019a0 to a2d4038 Compare May 13, 2026 17:20
@mmcdermott mmcdermott force-pushed the feat/add-dataset-eicu branch from 3260c05 to 896ee12 Compare May 13, 2026 17:22
@mmcdermott mmcdermott changed the base branch from feat/dataset-demo-availability to dev May 13, 2026 17:43
@mmcdermott mmcdermott force-pushed the feat/add-dataset-eicu branch from 896ee12 to 516b8dd Compare May 13, 2026 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant