Skip to content

feat: HDF5-backed dataloader for memory-efficient large-scale training#146

Merged
borauyar merged 5 commits into
BIMSBbioinfo:mainfrom
amitpande74:feat/hdf5-dataloader
May 20, 2026
Merged

feat: HDF5-backed dataloader for memory-efficient large-scale training#146
borauyar merged 5 commits into
BIMSBbioinfo:mainfrom
amitpande74:feat/hdf5-dataloader

Conversation

@amitpande74
Copy link
Copy Markdown
Collaborator

Adds H5DataImporter, a drop-in subclass of DataImporter that loads expression matrices from HDF5 (h5py) as native float32 instead of CSV. Includes csv_to_h5.py chunked-write converter for one-time format migration.

On a 118k-sample x 16k-gene compendium, this reduces peak data-loading RAM from over 60 GB (CSV path) to approximately 25 GB, enabling training on the full ARCHS4 bulk corpus without intermediate downsampling.

Implementation:

  • flexynesis/h5_dataloader.py: H5DataImporter (subclass; overrides only read_data() and validate_data_folders(), parent untouched)
  • flexynesis/csv_to_h5.py: chunked-write CSV-to-HDF5 converter
  • flexynesis/init.py: lazy exports for csv_to_h5 + H5DataImporter
  • pyproject.toml: h5py>=3.10 added to dependencies

Backward compatibility: H5DataImporter falls back to CSV automatically if an .h5 file is absent in the data folder. Parent DataImporter and all downstream Flexynesis components are unchanged.

Adds H5DataImporter, a drop-in subclass of DataImporter that loads
expression matrices from HDF5 (h5py) as native float32 instead of CSV.
Includes csv_to_h5.py chunked-write converter for one-time format
migration.

On a 118k-sample x 16k-gene compendium, this reduces peak data-loading
RAM from over 60 GB (CSV path) to approximately 25 GB, enabling training
on the full ARCHS4 bulk corpus without intermediate downsampling.

Implementation:
  - flexynesis/h5_dataloader.py: H5DataImporter (subclass; overrides only
    read_data() and validate_data_folders(), parent untouched)
  - flexynesis/csv_to_h5.py: chunked-write CSV-to-HDF5 converter
  - flexynesis/__init__.py: lazy exports for csv_to_h5 + H5DataImporter
  - pyproject.toml: h5py>=3.10 added to dependencies

Backward compatibility: H5DataImporter falls back to CSV automatically
if an .h5 file is absent in the data folder. Parent DataImporter and
all downstream Flexynesis components are unchanged.
@amitpande74 amitpande74 requested a review from borauyar May 14, 2026 13:37
Copy link
Copy Markdown
Member

@borauyar borauyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitpande74 this PR is currently not acceptable. It is completely suited to your current project-specific needs and it is not yet a generic feature anyone can use on their own projects. Please see my specific comments.

Comment thread flexynesis/csv_to_h5.py Outdated
Scope B 14 GB train CSV → GUI freeze + force-shutdown. HDF5 with
lazy per-sample loading sidesteps the whole problem.

Inputs (must exist):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitpande74 all these comments are specific to your project, but not generic information required for any user. These non-generic content should be removed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread flexynesis/csv_to_h5.py Outdated
import numpy as np
import pandas as pd

ROOT = Path("/home/amit/Desktop/projects/flexynesis")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded user-path. This cannot work for other users.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed/altered

Comment thread flexynesis/csv_to_h5.py Outdated
import pandas as pd

ROOT = Path("/home/amit/Desktop/projects/flexynesis")
SRC_DIR = ROOT / "processed_scaled_411k_tissue_B"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again hard-coded project folders; can't work for other users

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread flexynesis/csv_to_h5.py Outdated

ROOT = Path("/home/amit/Desktop/projects/flexynesis")
SRC_DIR = ROOT / "processed_scaled_411k_tissue_B"
DST_DIR = ROOT / "processed_scaled_411k_tissue_B_h5"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project-specific; not generic

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited

Comment thread flexynesis/csv_to_h5.py Outdated


def convert_split(split):
src_gex = SRC_DIR / split / "gex.csv"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the feature assumes there will be "gex.csv" files whereas we don't impose any such rule for anything except clin.csv files. The users can put whatever they want in the folder as long as it is in CSV format.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented

Comment thread flexynesis/csv_to_h5.py
@@ -0,0 +1,187 @@
#!/usr/bin/env python3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script altogether is not acceptable. It is not generic. It is for your specific project use-case. It would be useful if it worked on any csv file(s).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it generic

Comment thread flexynesis/h5_dataloader.py Outdated
h5_dataloader.py — HDF5-backed DataImporter for Flexynesis.

Why this exists:
Flexynesis stock DataImporter calls pd.read_csv(gex.csv) which loads
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project-specific comments have to be removed. Please think of this as a tool that will be re-used by other people in different circumstances.

Comment thread flexynesis/h5_dataloader.py Outdated
Subclasses Flexynesis DataImporter; overrides read_data to use HDF5
for the gex modality (the big one), CSV for everything else.

Expects gex stored as HDF5 with this layout:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't expect gex.csv at all. It should accept any CSV file.

Comment thread flexynesis/h5_dataloader.py Outdated
for file in required_files:
file_name = os.path.splitext(file)[0]

# GEX → HDF5 path; everything else → CSV
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use any project-specific comments.

Comment thread flexynesis/h5_dataloader.py Outdated
columns = sample IDs (str)
values = float32

HDF5 stores samples-as-rows (118k × 16k); we transpose to
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid any project-specific comments

Addresses review feedback on PR BIMSBbioinfo#146. The csv_to_h5 converter and the
H5DataImporter contained project-specific paths, comments, and assumptions
that prevented reuse outside the original project.

csv_to_h5.py:
  - Replace hard-coded input/output paths with argparse arguments;
    the converter now operates on any single CSV file.
  - Remove the train/test folder-pair assumption and clin.csv copying;
    one CSV is converted to one HDF5 file.
  - Remove project-specific narration from docstrings and comments.

h5_dataloader.py:
  - Remove project-specific commentary; generalise docstrings to describe
    a reusable tool rather than one project's setup.

Both files:
  - Use generic HDF5 dataset names (matrix, feature_names, sample_ids)
    in place of expression-specific names, so the format is not tied to
    gene-expression data.

Core behaviour is unchanged: H5DataImporter remains a drop-in subclass of
DataImporter overriding only read_data() and validate_data_folders(), with
automatic CSV fallback when no .h5 file is present.
@amitpande74
Copy link
Copy Markdown
Collaborator Author

Thanks for the review, Bora. You're right — I had pushed my project-specific
script as-is. Addressed in the latest commit:

  • csv_to_h5.py: hard-coded paths replaced with argparse arguments; works on
    any single CSV now. Removed the train/test folder assumption, clin.csv
    copying, and the gex.csv-specific handling.
  • Removed all project-specific comments and narration from both files.
  • Generic HDF5 dataset names (matrix / feature_names / sample_ids) so the
    format isn't tied to gene-expression data.

H5DataImporter behaviour is unchanged — drop-in DataImporter subclass with
CSV fallback. Ready for another look whenever you have time.

Copy link
Copy Markdown
Member

@borauyar borauyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitpande74 Thank you for the changes.
I realized while testing the code that this only works in interactive mode in a notebook. This doesn't work from the command-line.

I created a test dataset for you to test this feature:

wget https://bimsbstatic.mdc-berlin.de/akalin/buyar/flexynesis-benchmark-datasets/dataset1_h5.tgz
tar -xzvf dataset1_h5.tgz
# run the command line to test:
flexynesis --data_path dataset1_h5 --model_class DirectPred --target_variables Erlotinib --hpo_iter 1 --features_top_percentile 5 --data_types gex,cnv

Addresses Bora's second review on PR BIMSBbioinfo#146.

__main__.py:
  Add HDF5 autodetection before instantiating the training-mode
  DataImporter. If any modality file is present as .h5 in either the
  train/ or test/ split, switch to H5DataImporter; otherwise fall back
  to the stock CSV DataImporter. This makes the HDF5 path reachable
  from the command line, not only from notebooks. The CSV-only path is
  unchanged.

csv_to_h5.py:
  Fix incorrect feature-name extraction. The previous implementation
  used pd.read_csv(usecols=[0]) which treats row 0 as a header rather
  than the feature-index column, producing an array of 'nan' strings
  in the HDF5 file. Replaced with pd.read_csv(index_col=0, usecols=[0])
  so that the first column is read as the row index. Verified on the
  reviewer-supplied benchmark dataset: feature names are now unique
  and correct (A1CF, A2M, AADAC, ...).

Verified end-to-end on dataset1_h5: HDF5 detection log appears, H5
modalities are loaded as float32, per-file CSV fallback works for
modalities missing an .h5 file, and data validation passes.
@amitpande74
Copy link
Copy Markdown
Collaborator Author

amitpande74 commented May 20, 2026

Thanks again, Bora — testing on your benchmark dataset was the right
thing to do; it surfaced two issues:

  1. CLI autodetect: H5DataImporter wasn't reachable from the command line, only from notebooks. Added autodetection in main.py: if any modality is present as .h5 in train/ or test/, the training-mode
    importer switches to H5DataImporter automatically. CSV-only paths behave identically as before.

  2. csv_to_h5 feature-name bug: the converter was reading the index column with pd.read_csv(usecols=[0]), which treats row 0 as a header rather than as feature names — producing an array of 'nan' strings in the HDF5. The .h5 files in dataset1_h5 (train/gex.h5, test/cnv.h5) were built by that buggy version, which is why the validator complained about non-unique feature names and missing intersections. Replaced with pd.read_csv(index_col=0, usecols=[0]); re-running the converter on dataset1_h5/test/gex.csv now produces 3422 unique feature names (A1CF, A2M, AADAC, ... ZYX).

End-to-end CLI run on a corrected dataset shows the autodetect log line, both H5 modalities load as float32, per-file CSV fallback works for the asymmetric mixed-format layout, and data validation passes.

If you'd like to re-test, please regenerate the .h5 files in dataset1_h5 with the current converter; the original gex.csv / cnv.csv aren't in the archive, so I couldn't regenerate them on my end.

Regards,
Amit.

@borauyar borauyar merged commit ba9a778 into BIMSBbioinfo:main May 20, 2026
14 checks passed
@borauyar
Copy link
Copy Markdown
Member

Thank you @amitpande74 ! This seems to work. I added a workflow test for HDF5 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants