feat: HDF5-backed dataloader for memory-efficient large-scale training#146
Conversation
Adds H5DataImporter, a drop-in subclass of DataImporter that loads
expression matrices from HDF5 (h5py) as native float32 instead of CSV.
Includes csv_to_h5.py chunked-write converter for one-time format
migration.
On a 118k-sample x 16k-gene compendium, this reduces peak data-loading
RAM from over 60 GB (CSV path) to approximately 25 GB, enabling training
on the full ARCHS4 bulk corpus without intermediate downsampling.
Implementation:
- flexynesis/h5_dataloader.py: H5DataImporter (subclass; overrides only
read_data() and validate_data_folders(), parent untouched)
- flexynesis/csv_to_h5.py: chunked-write CSV-to-HDF5 converter
- flexynesis/__init__.py: lazy exports for csv_to_h5 + H5DataImporter
- pyproject.toml: h5py>=3.10 added to dependencies
Backward compatibility: H5DataImporter falls back to CSV automatically
if an .h5 file is absent in the data folder. Parent DataImporter and
all downstream Flexynesis components are unchanged.
borauyar
left a comment
There was a problem hiding this comment.
@amitpande74 this PR is currently not acceptable. It is completely suited to your current project-specific needs and it is not yet a generic feature anyone can use on their own projects. Please see my specific comments.
| Scope B 14 GB train CSV → GUI freeze + force-shutdown. HDF5 with | ||
| lazy per-sample loading sidesteps the whole problem. | ||
|
|
||
| Inputs (must exist): |
There was a problem hiding this comment.
@amitpande74 all these comments are specific to your project, but not generic information required for any user. These non-generic content should be removed.
| import numpy as np | ||
| import pandas as pd | ||
|
|
||
| ROOT = Path("/home/amit/Desktop/projects/flexynesis") |
There was a problem hiding this comment.
Hard-coded user-path. This cannot work for other users.
There was a problem hiding this comment.
changed/altered
| import pandas as pd | ||
|
|
||
| ROOT = Path("/home/amit/Desktop/projects/flexynesis") | ||
| SRC_DIR = ROOT / "processed_scaled_411k_tissue_B" |
There was a problem hiding this comment.
again hard-coded project folders; can't work for other users
|
|
||
| ROOT = Path("/home/amit/Desktop/projects/flexynesis") | ||
| SRC_DIR = ROOT / "processed_scaled_411k_tissue_B" | ||
| DST_DIR = ROOT / "processed_scaled_411k_tissue_B_h5" |
|
|
||
|
|
||
| def convert_split(split): | ||
| src_gex = SRC_DIR / split / "gex.csv" |
There was a problem hiding this comment.
the feature assumes there will be "gex.csv" files whereas we don't impose any such rule for anything except clin.csv files. The users can put whatever they want in the folder as long as it is in CSV format.
| @@ -0,0 +1,187 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
This script altogether is not acceptable. It is not generic. It is for your specific project use-case. It would be useful if it worked on any csv file(s).
There was a problem hiding this comment.
made it generic
| h5_dataloader.py — HDF5-backed DataImporter for Flexynesis. | ||
|
|
||
| Why this exists: | ||
| Flexynesis stock DataImporter calls pd.read_csv(gex.csv) which loads |
There was a problem hiding this comment.
project-specific comments have to be removed. Please think of this as a tool that will be re-used by other people in different circumstances.
| Subclasses Flexynesis DataImporter; overrides read_data to use HDF5 | ||
| for the gex modality (the big one), CSV for everything else. | ||
|
|
||
| Expects gex stored as HDF5 with this layout: |
There was a problem hiding this comment.
shouldn't expect gex.csv at all. It should accept any CSV file.
| for file in required_files: | ||
| file_name = os.path.splitext(file)[0] | ||
|
|
||
| # GEX → HDF5 path; everything else → CSV |
There was a problem hiding this comment.
please don't use any project-specific comments.
| columns = sample IDs (str) | ||
| values = float32 | ||
|
|
||
| HDF5 stores samples-as-rows (118k × 16k); we transpose to |
There was a problem hiding this comment.
please avoid any project-specific comments
Addresses review feedback on PR BIMSBbioinfo#146. The csv_to_h5 converter and the H5DataImporter contained project-specific paths, comments, and assumptions that prevented reuse outside the original project. csv_to_h5.py: - Replace hard-coded input/output paths with argparse arguments; the converter now operates on any single CSV file. - Remove the train/test folder-pair assumption and clin.csv copying; one CSV is converted to one HDF5 file. - Remove project-specific narration from docstrings and comments. h5_dataloader.py: - Remove project-specific commentary; generalise docstrings to describe a reusable tool rather than one project's setup. Both files: - Use generic HDF5 dataset names (matrix, feature_names, sample_ids) in place of expression-specific names, so the format is not tied to gene-expression data. Core behaviour is unchanged: H5DataImporter remains a drop-in subclass of DataImporter overriding only read_data() and validate_data_folders(), with automatic CSV fallback when no .h5 file is present.
|
Thanks for the review, Bora. You're right — I had pushed my project-specific
H5DataImporter behaviour is unchanged — drop-in DataImporter subclass with |
borauyar
left a comment
There was a problem hiding this comment.
@amitpande74 Thank you for the changes.
I realized while testing the code that this only works in interactive mode in a notebook. This doesn't work from the command-line.
I created a test dataset for you to test this feature:
wget https://bimsbstatic.mdc-berlin.de/akalin/buyar/flexynesis-benchmark-datasets/dataset1_h5.tgz
tar -xzvf dataset1_h5.tgz
# run the command line to test:
flexynesis --data_path dataset1_h5 --model_class DirectPred --target_variables Erlotinib --hpo_iter 1 --features_top_percentile 5 --data_types gex,cnv
Addresses Bora's second review on PR BIMSBbioinfo#146. __main__.py: Add HDF5 autodetection before instantiating the training-mode DataImporter. If any modality file is present as .h5 in either the train/ or test/ split, switch to H5DataImporter; otherwise fall back to the stock CSV DataImporter. This makes the HDF5 path reachable from the command line, not only from notebooks. The CSV-only path is unchanged. csv_to_h5.py: Fix incorrect feature-name extraction. The previous implementation used pd.read_csv(usecols=[0]) which treats row 0 as a header rather than the feature-index column, producing an array of 'nan' strings in the HDF5 file. Replaced with pd.read_csv(index_col=0, usecols=[0]) so that the first column is read as the row index. Verified on the reviewer-supplied benchmark dataset: feature names are now unique and correct (A1CF, A2M, AADAC, ...). Verified end-to-end on dataset1_h5: HDF5 detection log appears, H5 modalities are loaded as float32, per-file CSV fallback works for modalities missing an .h5 file, and data validation passes.
|
Thanks again, Bora — testing on your benchmark dataset was the right
End-to-end CLI run on a corrected dataset shows the autodetect log line, both H5 modalities load as float32, per-file CSV fallback works for the asymmetric mixed-format layout, and data validation passes. If you'd like to re-test, please regenerate the .h5 files in dataset1_h5 with the current converter; the original gex.csv / cnv.csv aren't in the archive, so I couldn't regenerate them on my end. Regards, |
|
Thank you @amitpande74 ! This seems to work. I added a workflow test for HDF5 as well. |
Adds H5DataImporter, a drop-in subclass of DataImporter that loads expression matrices from HDF5 (h5py) as native float32 instead of CSV. Includes csv_to_h5.py chunked-write converter for one-time format migration.
On a 118k-sample x 16k-gene compendium, this reduces peak data-loading RAM from over 60 GB (CSV path) to approximately 25 GB, enabling training on the full ARCHS4 bulk corpus without intermediate downsampling.
Implementation:
Backward compatibility: H5DataImporter falls back to CSV automatically if an .h5 file is absent in the data folder. Parent DataImporter and all downstream Flexynesis components are unchanged.