feat: HDF5-backed dataloader for memory-efficient large-scale training by amitpande74 · Pull Request #146 · BIMSBbioinfo/flexynesis

amitpande74 · 2026-05-14T13:37:10Z

Adds H5DataImporter, a drop-in subclass of DataImporter that loads expression matrices from HDF5 (h5py) as native float32 instead of CSV. Includes csv_to_h5.py chunked-write converter for one-time format migration.

On a 118k-sample x 16k-gene compendium, this reduces peak data-loading RAM from over 60 GB (CSV path) to approximately 25 GB, enabling training on the full ARCHS4 bulk corpus without intermediate downsampling.

Implementation:

flexynesis/h5_dataloader.py: H5DataImporter (subclass; overrides only read_data() and validate_data_folders(), parent untouched)
flexynesis/csv_to_h5.py: chunked-write CSV-to-HDF5 converter
flexynesis/init.py: lazy exports for csv_to_h5 + H5DataImporter
pyproject.toml: h5py>=3.10 added to dependencies

Backward compatibility: H5DataImporter falls back to CSV automatically if an .h5 file is absent in the data folder. Parent DataImporter and all downstream Flexynesis components are unchanged.

Adds H5DataImporter, a drop-in subclass of DataImporter that loads expression matrices from HDF5 (h5py) as native float32 instead of CSV. Includes csv_to_h5.py chunked-write converter for one-time format migration. On a 118k-sample x 16k-gene compendium, this reduces peak data-loading RAM from over 60 GB (CSV path) to approximately 25 GB, enabling training on the full ARCHS4 bulk corpus without intermediate downsampling. Implementation: - flexynesis/h5_dataloader.py: H5DataImporter (subclass; overrides only read_data() and validate_data_folders(), parent untouched) - flexynesis/csv_to_h5.py: chunked-write CSV-to-HDF5 converter - flexynesis/__init__.py: lazy exports for csv_to_h5 + H5DataImporter - pyproject.toml: h5py>=3.10 added to dependencies Backward compatibility: H5DataImporter falls back to CSV automatically if an .h5 file is absent in the data folder. Parent DataImporter and all downstream Flexynesis components are unchanged.

borauyar

@amitpande74 this PR is currently not acceptable. It is completely suited to your current project-specific needs and it is not yet a generic feature anyone can use on their own projects. Please see my specific comments.

borauyar · 2026-05-19T10:57:42Z

+  Scope B 14 GB train CSV → GUI freeze + force-shutdown. HDF5 with
+  lazy per-sample loading sidesteps the whole problem.
+
+Inputs (must exist):


@amitpande74 all these comments are specific to your project, but not generic information required for any user. These non-generic content should be removed.

borauyar · 2026-05-19T10:58:16Z

+import numpy as np
+import pandas as pd
+
+ROOT = Path("/home/amit/Desktop/projects/flexynesis")


Hard-coded user-path. This cannot work for other users.

changed/altered

borauyar · 2026-05-19T10:58:39Z

+import pandas as pd
+
+ROOT = Path("/home/amit/Desktop/projects/flexynesis")
+SRC_DIR = ROOT / "processed_scaled_411k_tissue_B"


again hard-coded project folders; can't work for other users

borauyar · 2026-05-19T10:58:51Z

+
+ROOT = Path("/home/amit/Desktop/projects/flexynesis")
+SRC_DIR = ROOT / "processed_scaled_411k_tissue_B"
+DST_DIR = ROOT / "processed_scaled_411k_tissue_B_h5"


project-specific; not generic

borauyar · 2026-05-19T10:59:44Z

+
+
+def convert_split(split):
+    src_gex = SRC_DIR / split / "gex.csv"


the feature assumes there will be "gex.csv" files whereas we don't impose any such rule for anything except clin.csv files. The users can put whatever they want in the folder as long as it is in CSV format.

implemented

borauyar · 2026-05-19T11:00:42Z

@@ -0,0 +1,187 @@
+#!/usr/bin/env python3


This script altogether is not acceptable. It is not generic. It is for your specific project use-case. It would be useful if it worked on any csv file(s).

made it generic

borauyar · 2026-05-19T11:01:20Z

+h5_dataloader.py — HDF5-backed DataImporter for Flexynesis.
+
+Why this exists:
+  Flexynesis stock DataImporter calls pd.read_csv(gex.csv) which loads


project-specific comments have to be removed. Please think of this as a tool that will be re-used by other people in different circumstances.

borauyar · 2026-05-19T11:01:51Z

+    Subclasses Flexynesis DataImporter; overrides read_data to use HDF5
+    for the gex modality (the big one), CSV for everything else.
+
+    Expects gex stored as HDF5 with this layout:


shouldn't expect gex.csv at all. It should accept any CSV file.

borauyar · 2026-05-19T11:02:12Z

+        for file in required_files:
+            file_name = os.path.splitext(file)[0]
+
+            # GEX → HDF5 path; everything else → CSV


please don't use any project-specific comments.

borauyar · 2026-05-19T11:02:32Z

+            columns = sample IDs   (str)
+            values  = float32
+
+        HDF5 stores samples-as-rows (118k × 16k); we transpose to


please avoid any project-specific comments

Addresses review feedback on PR BIMSBbioinfo#146. The csv_to_h5 converter and the H5DataImporter contained project-specific paths, comments, and assumptions that prevented reuse outside the original project. csv_to_h5.py: - Replace hard-coded input/output paths with argparse arguments; the converter now operates on any single CSV file. - Remove the train/test folder-pair assumption and clin.csv copying; one CSV is converted to one HDF5 file. - Remove project-specific narration from docstrings and comments. h5_dataloader.py: - Remove project-specific commentary; generalise docstrings to describe a reusable tool rather than one project's setup. Both files: - Use generic HDF5 dataset names (matrix, feature_names, sample_ids) in place of expression-specific names, so the format is not tied to gene-expression data. Core behaviour is unchanged: H5DataImporter remains a drop-in subclass of DataImporter overriding only read_data() and validate_data_folders(), with automatic CSV fallback when no .h5 file is present.

amitpande74 · 2026-05-19T12:53:34Z

Thanks for the review, Bora. You're right — I had pushed my project-specific
script as-is. Addressed in the latest commit:

csv_to_h5.py: hard-coded paths replaced with argparse arguments; works on
any single CSV now. Removed the train/test folder assumption, clin.csv
copying, and the gex.csv-specific handling.
Removed all project-specific comments and narration from both files.
Generic HDF5 dataset names (matrix / feature_names / sample_ids) so the
format isn't tied to gene-expression data.

H5DataImporter behaviour is unchanged — drop-in DataImporter subclass with
CSV fallback. Ready for another look whenever you have time.

borauyar

@amitpande74 Thank you for the changes.
I realized while testing the code that this only works in interactive mode in a notebook. This doesn't work from the command-line.

I created a test dataset for you to test this feature:

wget https://bimsbstatic.mdc-berlin.de/akalin/buyar/flexynesis-benchmark-datasets/dataset1_h5.tgz
tar -xzvf dataset1_h5.tgz
# run the command line to test:
flexynesis --data_path dataset1_h5 --model_class DirectPred --target_variables Erlotinib --hpo_iter 1 --features_top_percentile 5 --data_types gex,cnv

Addresses Bora's second review on PR BIMSBbioinfo#146. __main__.py: Add HDF5 autodetection before instantiating the training-mode DataImporter. If any modality file is present as .h5 in either the train/ or test/ split, switch to H5DataImporter; otherwise fall back to the stock CSV DataImporter. This makes the HDF5 path reachable from the command line, not only from notebooks. The CSV-only path is unchanged. csv_to_h5.py: Fix incorrect feature-name extraction. The previous implementation used pd.read_csv(usecols=[0]) which treats row 0 as a header rather than the feature-index column, producing an array of 'nan' strings in the HDF5 file. Replaced with pd.read_csv(index_col=0, usecols=[0]) so that the first column is read as the row index. Verified on the reviewer-supplied benchmark dataset: feature names are now unique and correct (A1CF, A2M, AADAC, ...). Verified end-to-end on dataset1_h5: HDF5 detection log appears, H5 modalities are loaded as float32, per-file CSV fallback works for modalities missing an .h5 file, and data validation passes.

amitpande74 · 2026-05-20T11:54:05Z

Thanks again, Bora — testing on your benchmark dataset was the right
thing to do; it surfaced two issues:

CLI autodetect: H5DataImporter wasn't reachable from the command line, only from notebooks. Added autodetection in main.py: if any modality is present as .h5 in train/ or test/, the training-mode
importer switches to H5DataImporter automatically. CSV-only paths behave identically as before.
csv_to_h5 feature-name bug: the converter was reading the index column with pd.read_csv(usecols=[0]), which treats row 0 as a header rather than as feature names — producing an array of 'nan' strings in the HDF5. The .h5 files in dataset1_h5 (train/gex.h5, test/cnv.h5) were built by that buggy version, which is why the validator complained about non-unique feature names and missing intersections. Replaced with pd.read_csv(index_col=0, usecols=[0]); re-running the converter on dataset1_h5/test/gex.csv now produces 3422 unique feature names (A1CF, A2M, AADAC, ... ZYX).

End-to-end CLI run on a corrected dataset shows the autodetect log line, both H5 modalities load as float32, per-file CSV fallback works for the asymmetric mixed-format layout, and data validation passes.

If you'd like to re-test, please regenerate the .h5 files in dataset1_h5 with the current converter; the original gex.csv / cnv.csv aren't in the archive, so I couldn't regenerate them on my end.

Regards,
Amit.

borauyar · 2026-05-20T12:45:58Z

Thank you @amitpande74 ! This seems to work. I added a workflow test for HDF5 as well.

amitpande74 requested a review from borauyar May 14, 2026 13:37

amitpande74 added 2 commits May 14, 2026 15:41

style: sort imports (isort) in h5_dataloader.py and csv_to_h5.py

0809bae

style: fix E221 (whitespace before operator) for flake8 compliance

6b3fd80

borauyar requested changes May 19, 2026

View reviewed changes

borauyar requested changes May 20, 2026

View reviewed changes

borauyar merged commit ba9a778 into BIMSBbioinfo:main May 20, 2026
14 checks passed



		def convert_split(split):
		src_gex = SRC_DIR / split / "gex.csv"

Conversation

amitpande74 commented May 14, 2026

Uh oh!

borauyar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amitpande74 commented May 19, 2026

Uh oh!

borauyar left a comment

Choose a reason for hiding this comment

Uh oh!

amitpande74 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

borauyar commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amitpande74 commented May 20, 2026 •

edited

Loading