Skip to content

Commit 71f4dd8

Browse files
committed
Geneformer: TransformerEngine checkpoint conversion
Signed-off-by: Ohad Mosafi <omosafi@nvidia.com>
1 parent 4bf3878 commit 71f4dd8

13 files changed

Lines changed: 2552 additions & 0 deletions

models/geneformer/Dockerfile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
FROM nvcr.io/nvidia/pytorch:25.06-py3
2+
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
3+
WORKDIR /workspace/bionemo
4+
COPY . .
5+
RUN --mount=type=cache,target=/root/.cache/uv \
6+
PIP_CONSTRAINT= pip install -e .

models/geneformer/LICENSE

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
2+
Apache License
3+
Version 2.0, January 2004
4+
http://www.apache.org/licenses/
5+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6+
7+
1. Definitions.
8+
9+
"License" shall mean the terms and conditions for use, reproduction,
10+
and distribution as defined by Sections 1 through 9 of this document.
11+
"Licensor" shall mean the copyright owner or entity authorized by
12+
the copyright owner that is granting the License.
13+
"Legal Entity" shall mean the union of the acting entity and all
14+
other entities that control, are controlled by, or are under common
15+
control with that entity. For the purposes of this definition,
16+
"control" means (i) the power, direct or indirect, to cause the
17+
direction or management of such entity, whether by contract or
18+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
19+
outstanding shares, or (iii) beneficial ownership of such entity.
20+
"You" (or "Your") shall mean an individual or Legal Entity
21+
exercising permissions granted by this License.
22+
"Source" form shall mean the preferred form for making modifications,
23+
including but not limited to software source code, documentation
24+
source, and configuration files.
25+
"Object" form shall mean any form resulting from mechanical
26+
transformation or translation of a Source form, including but
27+
not limited to compiled object code, generated documentation,
28+
and conversions to other media types.
29+
"Work" shall mean the work of authorship, whether in Source or
30+
Object form, made available under the License, as indicated by a
31+
copyright notice that is included in or attached to the work
32+
(an example is provided in the Appendix below).
33+
"Derivative Works" shall mean any work, whether in Source or Object
34+
form, that is based on (or derived from) the Work and for which the
35+
editorial revisions, annotations, elaborations, or other modifications
36+
represent, as a whole, an original work of authorship. For the purposes
37+
of this License, Derivative Works shall not include works that remain
38+
separable from, or merely link (or bind by name) to the interfaces of,
39+
the Work and Derivative Works thereof.
40+
"Contribution" shall mean any work of authorship, including
41+
the original version of the Work and any modifications or additions
42+
to that Work or Derivative Works thereof, that is intentionally
43+
submitted to Licensor for inclusion in the Work by the copyright owner
44+
or by an individual or Legal Entity authorized to submit on behalf of
45+
the copyright owner. For the purposes of this definition, "submitted"
46+
means any form of electronic, verbal, or written communication sent
47+
to the Licensor or its representatives, including but not limited to
48+
communication on electronic mailing lists, source code control systems,
49+
and issue tracking systems that are managed by, or on behalf of, the
50+
Licensor for the purpose of discussing and improving the Work, but
51+
excluding communication that is conspicuously marked or otherwise
52+
designated in writing by the copyright owner as "Not a Contribution."
53+
"Contributor" shall mean Licensor and any individual or Legal Entity
54+
on behalf of whom a Contribution has been received by Licensor and
55+
subsequently incorporated within the Work.
56+
2. Grant of Copyright License. Subject to the terms and conditions of
57+
this License, each Contributor hereby grants to You a perpetual,
58+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
59+
copyright license to reproduce, prepare Derivative Works of,
60+
publicly display, publicly perform, sublicense, and distribute the
61+
Work and such Derivative Works in Source or Object form.
62+
3. Grant of Patent License. Subject to the terms and conditions of
63+
this License, each Contributor hereby grants to You a perpetual,
64+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
65+
(except as stated in this section) patent license to make, have made,
66+
use, offer to sell, sell, import, and otherwise transfer the Work,
67+
where such license applies only to those patent claims licensable
68+
by such Contributor that are necessarily infringed by their
69+
Contribution(s) alone or by combination of their Contribution(s)
70+
with the Work to which such Contribution(s) was submitted. If You
71+
institute patent litigation against any entity (including a
72+
cross-claim or counterclaim in a lawsuit) alleging that the Work
73+
or a Contribution incorporated within the Work constitutes direct
74+
or contributory patent infringement, then any patent licenses
75+
granted to You under this License for that Work shall terminate
76+
as of the date such litigation is filed.
77+
4. Redistribution. You may reproduce and distribute copies of the
78+
Work or Derivative Works thereof in any medium, with or without
79+
modifications, and in Source or Object form, provided that You
80+
meet the following conditions:
81+
(a) You must give any other recipients of the Work or
82+
Derivative Works a copy of this License; and
83+
(b) You must cause any modified files to carry prominent notices
84+
stating that You changed the files; and
85+
(c) You must retain, in the Source form of any Derivative Works
86+
that You distribute, all copyright, patent, trademark, and
87+
attribution notices from the Source form of the Work,
88+
excluding those notices that do not pertain to any part of
89+
the Derivative Works; and
90+
(d) If the Work includes a "NOTICE" text file as part of its
91+
distribution, then any Derivative Works that You distribute must
92+
include a readable copy of the attribution notices contained
93+
within such NOTICE file, excluding those notices that do not
94+
pertain to any part of the Derivative Works, in at least one
95+
of the following places: within a NOTICE text file distributed
96+
as part of the Derivative Works; within the Source form or
97+
documentation, if provided along with the Derivative Works; or,
98+
within a display generated by the Derivative Works, if and
99+
wherever such third-party notices normally appear. The contents
100+
of the NOTICE file are for informational purposes only and
101+
do not modify the License. You may add Your own attribution
102+
notices within Derivative Works that You distribute, alongside
103+
or as an addendum to the NOTICE text from the Work, provided
104+
that such additional attribution notices cannot be construed
105+
as modifying the License.
106+
You may add Your own copyright statement to Your modifications and
107+
may provide additional or different license terms and conditions
108+
for use, reproduction, or distribution of Your modifications, or
109+
for any such Derivative Works as a whole, provided Your use,
110+
reproduction, and distribution of the Work otherwise complies with
111+
the conditions stated in this License.
112+
5. Submission of Contributions. Unless You explicitly state otherwise,
113+
any Contribution intentionally submitted for inclusion in the Work
114+
by You to the Licensor shall be under the terms and conditions of
115+
this License, without any additional terms or conditions.
116+
Notwithstanding the above, nothing herein shall supersede or modify
117+
the terms of any separate license agreement you may have executed
118+
with Licensor regarding such Contributions.
119+
6. Trademarks. This License does not grant permission to use the trade
120+
names, trademarks, service marks, or product names of the Licensor,
121+
except as required for reasonable and customary use in describing the
122+
origin of the Work and reproducing the content of the NOTICE file.
123+
7. Disclaimer of Warranty. Unless required by applicable law or
124+
agreed to in writing, Licensor provides the Work (and each
125+
Contributor provides its Contributions) on an "AS IS" BASIS,
126+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
127+
implied, including, without limitation, any warranties or conditions
128+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
129+
PARTICULAR PURPOSE. You are solely responsible for determining the
130+
appropriateness of using or redistributing the Work and assume any
131+
risks associated with Your exercise of permissions under this License.
132+
8. Limitation of Liability. In no event and under no legal theory,
133+
whether in tort (including negligence), contract, or otherwise,
134+
unless required by applicable law (such as deliberate and grossly
135+
negligent acts) or agreed to in writing, shall any Contributor be
136+
liable to You for damages, including any direct, indirect, special,
137+
incidental, or consequential damages of any character arising as a
138+
result of this License or out of the use or inability to use the
139+
Work (including but not limited to damages for loss of goodwill,
140+
work stoppage, computer failure or malfunction, or any and all
141+
other commercial damages or losses), even if such Contributor
142+
has been advised of the possibility of such damages.
143+
9. Accepting Warranty or Additional Liability. While redistributing
144+
the Work or Derivative Works thereof, You may choose to offer,
145+
and charge a fee for, acceptance of support, warranty, indemnity,
146+
or other liability obligations and/or rights consistent with this
147+
License. However, in accepting such obligations, You may act only
148+
on Your own behalf and on Your sole responsibility, not on behalf
149+
of any other Contributor, and only if You agree to indemnify,
150+
defend, and hold each Contributor harmless for any liability
151+
incurred by, or claims asserted against, such Contributor by reason
152+
of your accepting any such warranty or additional liability.
153+
END OF TERMS AND CONDITIONS
154+
155+
APPENDIX: How to apply the Apache License to your work.
156+
157+
To apply the Apache License to your work, attach the following
158+
boilerplate notice, with the fields enclosed by brackets "[]"
159+
replaced with your own identifying information. (Don't include
160+
the brackets!) The text should be enclosed in the appropriate
161+
comment syntax for the file format. We also recommend that a
162+
file or class name and description of purpose be included on the
163+
same "printed page" as the copyright notice for easier
164+
identification within third-party archives.
165+
Copyright 2022 Theodoris Lab, Gladstone Institute and The HuggingFace Inc. team. All rights reserved.
166+
Copyright 2025 NVIDIA CORPORATION. All rights reserved.
167+
168+
Licensed under the Apache License, Version 2.0 (the "License");
169+
you may not use this file except in compliance with the License.
170+
You may obtain a copy of the License at
171+
172+
http://www.apache.org/licenses/LICENSE-2.0
173+
Unless required by applicable law or agreed to in writing, software
174+
distributed under the License is distributed on an "AS IS" BASIS,
175+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
176+
See the License for the specific language governing permissions and
177+
limitations under the License.

models/geneformer/README.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Geneformer Implemented with TE layers
2+
3+
Running tests:
4+
5+
```bash
6+
docker build -t geneformer .
7+
docker run --rm -it --gpus all geneformer pytest tests/
8+
```
9+
10+
Generating converted Geneformer checkpoints:
11+
12+
```bash
13+
docker run --rm -it --gpus all \
14+
-v /path/to/checkpoint_export/:/workspace/bionemo/geneformer_export \
15+
-v $HOME/.cache/huggingface/:/root/.cache/huggingface \
16+
geneformer python export.py hf-to-te --model Geneformer-V2-104M
17+
```
18+
19+
## Model Conversion Process
20+
21+
This section explains how to convert between Hugging Face and Transformer Engine (TE) Geneformer model formats. The process demonstrates bidirectional conversion: from Hugging Face to TE format for optimized inference, and back to Hugging Face format for sharing and deployment. The workflow involves several key steps:
22+
23+
### Step 1: Load Original Hugging Face Model
24+
25+
First, load the original Geneformer model from Hugging Face:
26+
27+
```python
28+
from transformers import AutoModelForMaskedLM
29+
30+
model_hf_original = AutoModelForMaskedLM.from_pretrained(
31+
"ctheodoris/Geneformer", subfolder="Geneformer-V2-104M"
32+
)
33+
```
34+
35+
This loads the pre-trained Geneformer model that will serve as our reference for comparison.
36+
37+
### Step 2: Export to Transformer Engine Format
38+
39+
Convert the Hugging Face model to Transformer Engine format using the high-level export API:
40+
41+
```python
42+
from geneformer.export import export_hf_checkpoint
43+
from pathlib import Path
44+
45+
te_checkpoint_path = Path("te_checkpoint")
46+
export_hf_checkpoint("Geneformer-V2-104M", te_checkpoint_path / "Geneformer-V2-104M")
47+
```
48+
49+
This creates a Transformer Engine checkpoint that can be used for optimized inference.
50+
51+
### Step 3: Export Back to Hugging Face Format
52+
53+
Convert the Transformer Engine checkpoint back to Hugging Face format:
54+
55+
```python
56+
from geneformer.export import export_te_checkpoint
57+
from pathlib import Path
58+
59+
hf_export_path = Path("hf_export")
60+
exported_model_path = te_checkpoint_path / "Geneformer-V2-104M"
61+
export_te_checkpoint(str(exported_model_path), str(hf_export_path))
62+
```
63+
64+
This step creates a new Hugging Face model that should be functionally equivalent to the original.
65+
66+
## Local development with vscode
67+
68+
To get vscode to run these tests, you can to add the following to your `.vscode/settings.json`:
69+
70+
```json
71+
{
72+
"python.testing.pytestArgs": [
73+
"models/geneformer/tests"
74+
],
75+
"python.testing.unittestEnabled": false,
76+
"python.testing.pytestEnabled": true
77+
}
78+
```
79+
80+
Additionally, run the following command to install the dependencies:
81+
82+
```bash
83+
cd models/geneformer
84+
PIP_CONSTRAINT= pip install -e .[test]
85+
```

models/geneformer/export.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: LicenseRef-Apache2
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
import argparse
17+
from pathlib import Path
18+
19+
from geneformer.export import export_hf_checkpoint, export_te_checkpoint
20+
21+
22+
GENEFORMER_MODELS = [
23+
"Geneformer-V1-10M",
24+
"Geneformer-V2-104M",
25+
"Geneformer-V2-316M",
26+
"Geneformer-V2-104M_CLcancer",
27+
]
28+
29+
30+
def main():
31+
"""Export Geneformer models from Hugging Face Hub to a standardized format."""
32+
parser = argparse.ArgumentParser(
33+
description="Convert Geneformer models from Hugging Face Hub to standardized format"
34+
)
35+
36+
subparsers = parser.add_subparsers(dest="conversion_type", required=True, help="Type of conversion to perform")
37+
38+
hf_to_te_parser = subparsers.add_parser("hf-to-te", help="Convert from HuggingFace to Transformer Engine format")
39+
hf_to_te_parser.add_argument(
40+
"--model",
41+
type=str,
42+
choices=GENEFORMER_MODELS,
43+
help="Specific model to convert. If not provided, all models will be converted.",
44+
)
45+
hf_to_te_parser.add_argument(
46+
"--output-path",
47+
type=str,
48+
default="./hf_to_te_checkpoint_export",
49+
help="Output directory path for the converted model. Defaults to './hf_to_te_checkpoint_export'",
50+
)
51+
52+
te_to_hf_parser = subparsers.add_parser("te-to-hf", help="Convert from Transformer Engine to HuggingFace format")
53+
te_to_hf_parser.add_argument(
54+
"--checkpoint-path", type=str, required=True, help="Path to the HuggingFace checkpoint to convert"
55+
)
56+
te_to_hf_parser.add_argument(
57+
"--output-path",
58+
type=str,
59+
default="./te_to_hf_checkpoint_export",
60+
help="Output directory path for the converted model. Defaults to './te_to_hf_checkpoint_export'",
61+
)
62+
63+
args = parser.parse_args()
64+
65+
print(f"Performing {args.conversion_type} conversion...")
66+
if args.conversion_type == "hf-to-te":
67+
if args.model:
68+
if args.model not in GENEFORMER_MODELS:
69+
print(f"Error: '{args.model}' is not a valid model.\nAvailable models: {', '.join(GENEFORMER_MODELS)}")
70+
return
71+
72+
print(f"Converting {args.model} from Hugging Face Hub...")
73+
export_hf_checkpoint(args.model, Path(args.output_path) / args.model)
74+
else:
75+
for model in GENEFORMER_MODELS:
76+
print(f"Converting {model} from Hugging Face Hub...")
77+
export_hf_checkpoint(model, Path(args.output_path) / model)
78+
else:
79+
print(f"Converting {args.checkpoint_path}...")
80+
export_te_checkpoint(args.checkpoint_path, Path(args.output_path))
81+
82+
83+
if __name__ == "__main__":
84+
main()

models/geneformer/model_readme.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
library_name: transformers
3+
license: apache-2.0
4+
widget:
5+
- text: [<cls>, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, <eos>]
6+
---
7+
8+
## Geneformer (TransformerEngine-optimized)
9+
10+
This version of the Geneformer model is optimized with NVIDIA's
11+
[TransformerEngine](https://github.com/NVIDIA/TransformerEngine) library. It is based on the
12+
[original Geneformer model](https://huggingface.co/ctheodoris/Geneformer) from Theodoris et al.,
13+
and (within numerical precision) has identical weights and outputs.
14+
15+
Geneformer is a foundational transformer model pretrained on a large-scale corpus of single cell transcriptomes
16+
representing a broad range of human tissues. It is suitable for fine-tuning on a wide range of tasks that take
17+
gene expression data as input. For detailed information on the model architecture and training data, please refer
18+
to the [accompanying paper](https://rdcu.be/ddrx0). You may also be interested in the
19+
[documentation](https://geneformer.readthedocs.io) and [examples](https://huggingface.co/ctheodoris/Geneformer/tree/main/examples)
20+
which demonstrate how to fine-tune Geneformer models on your tasks of interest.
21+
22+
Several Geneformer checkpoints are available in the Hub with varying sizes. Larger sizes generally have
23+
somewhat better accuracy, but require much more memory and time to train:
24+
25+
| Checkpoint name | Parameters | Input size | Vocabulary | Training data |
26+
| ----------------------------------------------------------------------------------------------------------------- | ---------- | ---------- | ---------- | ------------------------ |
27+
| [Geneformer-V1-10M](https://huggingface.co/ctheodoris/Geneformer/tree/main/Geneformer-V1-10M) | 10M | 2048 | ~25K genes | ~30M cells |
28+
| [Geneformer-V2-104M](https://huggingface.co/ctheodoris/Geneformer/tree/main/Geneformer-V2-104M) | 104M | 4096 | ~20K genes | ~104M cells |
29+
| [Geneformer-V2-316M](https://huggingface.co/ctheodoris/Geneformer) | 316M | 4096 | ~20K genes | ~104M cells |
30+
| [Geneformer-V2-104M_CLcancer](https://huggingface.co/ctheodoris/Geneformer/tree/main/Geneformer-V2-104M_CLcancer) | 104M | 4096 | ~20K genes | ~104M + 14M cancer cells |

0 commit comments

Comments
 (0)