Skip to content

Commit 4b57e8c

Browse files
committed
update mdformat, pin python version
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
1 parent 153346d commit 4b57e8c

22 files changed

Lines changed: 199 additions & 198 deletions

File tree

.github/pull_request_template.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Configure CI behavior by applying the relevant labels:
2020
- [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest
2121
- [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing
2222

23-
> \[!NOTE\]
23+
> [!NOTE]
2424
> By default, the notebooks validation tests are skipped unless explicitly enabled.
2525
2626
#### Authorizing CI Runs

.pre-commit-config.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,10 @@ repos:
1414
args: ["--fix"]
1515
- id: ruff-format
1616
- repo: https://github.com/executablebooks/mdformat
17-
rev: 0.7.17 # Use the latest stable version
17+
rev: 0.7.22 # Use the latest stable version
1818
hooks:
1919
- id: mdformat
20+
language_version: python3.13
2021
additional_dependencies:
2122
- mdformat-tables
2223
- mdformat-gfm

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ We distribute a [development container](https://devcontainers.github.io/) config
8282
(`.devcontainer/devcontainer.json`) that simplifies the process of local testing and development. Opening the
8383
bionemo-framework folder with VSCode should prompt you to re-open the folder inside the devcontainer environment.
8484

85-
> \[!NOTE\]
85+
> [!NOTE]
8686
> The first time you launch the devcontainer, it may take a long time to build the image. Building the image locally
8787
> (using the command shown above) will ensure that most of the layers are present in the local docker cache.
8888

docs/docs/main/about/background/megatron_datasets.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,10 @@ would apply different random masks or different data augmentation strategies eac
1313
provides some utilities that make multi-epoch training easier, while obeying the determinism requirements of
1414
megatron.
1515

16-
The \[MultiEpochDatasetResampler\]\[bionemo.core.data.multi_epoch_dataset.MultiEpochDatasetResampler\] class simplifies the
16+
The [MultiEpochDatasetResampler][bionemo.core.data.multi_epoch_dataset.MultiEpochDatasetResampler] class simplifies the
1717
process of multi-epoch training, where the data should both be re-shuffled each epoch with different random effects
1818
applied each time the data is seen. To be compatible with this resampler, the provided dataset class's `__getitem__`
19-
method should accept a \[EpochIndex\]\[bionemo.core.data.multi_epoch_dataset.EpochIndex\] tuple that contains both an epoch
19+
method should accept a [EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] tuple that contains both an epoch
2020
and index value. Random effects can then be performed by setting the torch random seed based on the epoch value:
2121

2222
```python
@@ -37,9 +37,9 @@ details.
3737
```
3838

3939
For deterministic datasets that still want to train for multiple epochs with epoch-level shuffling, the
40-
\[IdentityMultiEpochDatasetWrapper\]\[bionemo.core.data.multi_epoch_dataset.IdentityMultiEpochDatasetWrapper\] class can
40+
[IdentityMultiEpochDatasetWrapper][bionemo.core.data.multi_epoch_dataset.IdentityMultiEpochDatasetWrapper] class can
4141
simplify this process by wrapping a dataset that accepts integer indices and passes along the
42-
\[EpochIndex\]\[bionemo.core.data.multi_epoch_dataset.EpochIndex\] index values from the resampled dataset.
42+
[EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] index values from the resampled dataset.
4343

4444
```python
4545
class MyDeterministicDataset:
@@ -53,7 +53,7 @@ for sample in MultiEpochDatasetResampler(dataset, num_epochs=3, shuffle=True):
5353

5454
## Training Resumption
5555

56-
To ensure identical behavior with and without job interruption, BioNeMo provides \[MegatronDataModule\]\[bionemo.llm.data.datamodule.MegatronDataModule\] to save and load state dict for training resumption, and provides \[WrappedDataLoader\]\[nemo.lightning.data.WrappedDataLoader\] to add a `mode` attribute to \[DataLoader\]\[torch.utils.data.DataLoader\].
56+
To ensure identical behavior with and without job interruption, BioNeMo provides [MegatronDataModule][bionemo.llm.data.datamodule.MegatronDataModule] to save and load state dict for training resumption, and provides [WrappedDataLoader][nemo.lightning.data.WrappedDataLoader] to add a `mode` attribute to [DataLoader][torch.utils.data.DataLoader].
5757

5858
```python
5959
class MyDataModule(MegatronDataModule):
@@ -100,7 +100,7 @@ WARNING: 'train' is the default value of `mode` in `WrappedDataLoader`. If not s
100100
## Testing Datasets for Megatron Compatibility
101101

102102
BioNeMo also provides utility functions for test suites to validate that datasets conform to the megatron data model.
103-
The \[assert_dataset_compatible_with_megatron\]\[bionemo.testing.data_utils.assert_dataset_compatible_with_megatron\]
103+
The [assert_dataset_compatible_with_megatron][bionemo.testing.data_utils.assert_dataset_compatible_with_megatron]
104104
function calls the dataset with identical indices and ensures the outputs are identical, while also checking to see if
105105
`torch.manual_seed` was used.
106106

docs/docs/main/about/releasenotes-fw.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -169,26 +169,26 @@
169169

170170
### New Features
171171

172-
- \[Documentation\] Updated, executable ESM-2nv notebooks demonstrating: Data preprocessing and model training with custom datasets, Fine-tuning on FLIP data, Inference on OAS sequences, Pre-training from scratch and continuing training
173-
- \[Documentation\] New notebook demonstrating Zero-Shot Protein Design Using ESM-2nv. Thank you to @awlange from A-Alpha Bio for contributing the original version of this recipe!
172+
- [Documentation] Updated, executable ESM-2nv notebooks demonstrating: Data preprocessing and model training with custom datasets, Fine-tuning on FLIP data, Inference on OAS sequences, Pre-training from scratch and continuing training
173+
- [Documentation] New notebook demonstrating Zero-Shot Protein Design Using ESM-2nv. Thank you to @awlange from A-Alpha Bio for contributing the original version of this recipe!
174174

175175
### Bug fixes and Improvements
176176

177-
- \[Geneformer\] Fixed bug in preprocessing due to a relocation of dependent artifacts.
178-
- \[Geneformer\] Fixes bug in finetuning to use the newer preprocessing constructor.
177+
- [Geneformer] Fixed bug in preprocessing due to a relocation of dependent artifacts.
178+
- [Geneformer] Fixes bug in finetuning to use the newer preprocessing constructor.
179179

180180
## BioNeMo Framework v1.8
181181

182182
### New Features
183183

184-
- \[Documentation\] Updated, executable MolMIM notebooks demonstrating: Training on custom data, Inference and downstream prediction, ZINC15 dataset preprocesing, and CMA-ES optimization
185-
- \[Dependencies\] Upgraded the framework to [NeMo v1.23](https://github.com/NVIDIA/NeMo/tree/v1.23.0), which updates PyTorch to version 2.2.0a0+81ea7a4 and CUDA to version 12.3.
184+
- [Documentation] Updated, executable MolMIM notebooks demonstrating: Training on custom data, Inference and downstream prediction, ZINC15 dataset preprocesing, and CMA-ES optimization
185+
- [Dependencies] Upgraded the framework to [NeMo v1.23](https://github.com/NVIDIA/NeMo/tree/v1.23.0), which updates PyTorch to version 2.2.0a0+81ea7a4 and CUDA to version 12.3.
186186

187187
### Bug fixes and Improvements
188188

189-
- \[ESM2\] Fixed a bug in gradient accumulation in encoder fine-tuning
190-
- \[MegaMolBART\] Make MegaMolBART encoder finetuning respect random seed set by user
191-
- \[MegaMolBART\] Finetuning with val_check_interval=1 bug fix
189+
- [ESM2] Fixed a bug in gradient accumulation in encoder fine-tuning
190+
- [MegaMolBART] Make MegaMolBART encoder finetuning respect random seed set by user
191+
- [MegaMolBART] Finetuning with val_check_interval=1 bug fix
192192

193193
### Known Issues
194194

@@ -204,8 +204,8 @@
204204

205205
### New Features
206206

207-
- \[EquiDock\] Remove steric clashes as a post-processing step after equidock inference.
208-
- \[Documentation\] Updated Getting Started section which sequentially describes prerequisites, BioNeMo Framework access, startup instructions, and next steps.
207+
- [EquiDock] Remove steric clashes as a post-processing step after equidock inference.
208+
- [Documentation] Updated Getting Started section which sequentially describes prerequisites, BioNeMo Framework access, startup instructions, and next steps.
209209

210210
### Known Issues
211211

@@ -215,11 +215,11 @@
215215

216216
### New Features
217217

218-
- \[Model Fine-tuning\] `model.freeze_layers` fine-tuning config parameter added to freeze a specified number of layers. Thank you to github user [@nehap25](https://github.com/nehap25)!
219-
- \[ESM2\] Loading pre-trained ESM-2 weights and continue pre-training on the MLM objective on a custom FASTA dataset is now supported.
220-
- \[OpenFold\] MLPerf feature 3.2 bug (mha_fused_gemm) fix has merged.
221-
- \[OpenFold\] MLPerf feature 3.10 integrated into bionemo framework.
222-
- \[DiffDock\] Updated data loading module for DiffDock model training, changing from sqlite3 backend to webdataset.
218+
- [Model Fine-tuning] `model.freeze_layers` fine-tuning config parameter added to freeze a specified number of layers. Thank you to github user [@nehap25](https://github.com/nehap25)!
219+
- [ESM2] Loading pre-trained ESM-2 weights and continue pre-training on the MLM objective on a custom FASTA dataset is now supported.
220+
- [OpenFold] MLPerf feature 3.2 bug (mha_fused_gemm) fix has merged.
221+
- [OpenFold] MLPerf feature 3.10 integrated into bionemo framework.
222+
- [DiffDock] Updated data loading module for DiffDock model training, changing from sqlite3 backend to webdataset.
223223

224224
## BioNeMo Framework v1.5
225225

docs/docs/main/contributing/code-review.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -410,7 +410,7 @@ a fruitful interaction across the team members.
410410
that will allow other platforms to continue working.
411411

412412
- Don't write commit messages that are vague or wouldn't make sense to
413-
partners that read the logs. For example, do not write "\[topic\]
413+
partners that read the logs. For example, do not write "[topic]
414414
Bugfix" as your header in the commit message. Keep links to videos
415415
out of the commit message. Again, partners are going to see these
416416
logs and it does not make sense to link to something they will not

docs/docs/main/contributing/contributing.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ sufficient code review before being merged.
1212
## Developer Certificate of Origin (DCO)
1313

1414
We require that all contributors "sign-off" on their commits (not GPG signing, just adding the `-s | --signoff`
15-
argument, or follow the instructions below for auto-signing). This sign-off certifies that you adhere to the Developer
15+
argument, or follow the instructions below for auto-signing). This sign-off certifies that you adhere to the Developer
1616
Certificate of Origin (DCO) ([full text](https://developercertificate.org/)); in short that the contribution is your
1717
original work, or you have rights to submit it under the same license or a compatible license.
1818

@@ -171,7 +171,7 @@ For both internal and external developers, the next step is opening a PR:
171171
Note that versioned releases of TensorRT OSS are posted to `release/` branches of the upstream repo.
172172
- Creation of a PR creation kicks off the code review process.
173173
- At least one TensorRT engineer will be assigned for the review.
174-
- While under review, mark your PRs as work-in-progress by prefixing the PR title with \[WIP\].
174+
- While under review, mark your PRs as work-in-progress by prefixing the PR title with [WIP].
175175
2. Once ready, CI can be started by a developer with permissions when they add a `/build-ci` comment. This must pass
176176
prior to merging.
177177

docs/docs/main/datasets/uniprot.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# UniProt Dataset
22

3-
The UniProt Knowledgebase (UniProtKB) is an open database of protein sequences curated from translated genomic data \[1\].
4-
The UniProt Reference Cluster (UniRef) databases provide clustered sets of sequences from UniProtKB \[2\], which have been
3+
The UniProt Knowledgebase (UniProtKB) is an open database of protein sequences curated from translated genomic data [1].
4+
The UniProt Reference Cluster (UniRef) databases provide clustered sets of sequences from UniProtKB [2], which have been
55
used in previous large language model training studies to improve diversity in protein training data. UniRef clusters
66
proteins hierarchically. At the highest level, UniRef100 groups proteins with identical primary sequences from the
77
UniProt Archive (UniParc). UniRef90 clusters these unique sequences into buckets with 90% sequence similarity, selecting
@@ -10,7 +10,7 @@ UniRef90 representative sequences into groups with 50% sequence similarity.
1010

1111
## Data Used for ESM-2 Pre-training
1212

13-
Since the original train/test splits from ESM-2 were not available \[3\], we replicated the ESM-2 pre-training experiments
13+
Since the original train/test splits from ESM-2 were not available [3], we replicated the ESM-2 pre-training experiments
1414
with UniProt's 2024_03 release. Following the approach described by the ESM-2 authors, we removed artificial sequences
1515
and reserved 0.5% of UniRef50 clusters for validation. From the 65,672,139 UniRef50 clusters, this resulted in 328,360
1616
validation sequences. We then ran MMSeqs to further ensure no contamination of the training set with sequences similar
@@ -22,7 +22,7 @@ randomly chosen UniRef90 sequence from each.
2222
## Data Availability
2323

2424
Two versions of the dataset are distributed, a full training dataset (~80GB) and a 10,000 UniRef50 cluster random slice
25-
(~150MB). To load and use the sanity dataset, use the \[bionemo.core.data.load\]\[bionemo.core.data.load.load\] function
25+
(~150MB). To load and use the sanity dataset, use the [bionemo.core.data.load][bionemo.core.data.load.load] function
2626
to materialize the sanity dataset in the BioNeMo2 cache directory:
2727

2828
```python
@@ -34,7 +34,7 @@ sanity_data_dir = load("esm2/testdata_esm2_pretrain:2.0")
3434
### NGC Resource Links
3535

3636
- [Sanity Dataset](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/esm2_pretrain_nemo2_testdata/files)
37-
- \[Full Dataset\]
37+
- [Full Dataset]
3838

3939
## References
4040

docs/docs/main/getting-started/initialization-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -210,7 +210,7 @@ Below we explain some common `docker run` options and how to use them as part of
210210

211211
### Mounting Volumes with the `-v` Option
212212

213-
The `-v` allows you to mount a host machine's directory as a volume inside the
213+
The `-v` allows you to mount a host machine's directory as a volume inside the
214214
container. This enables data persistence even after the container is deleted or restarted. In the context of machine
215215
learning workflows, leveraging the `-v` option is essential for maintaining a local cache of datasets, model weights, and
216216
results on the host machine such that they can persist after the container terminates and be reused across container

docs/docs/main/getting-started/training-models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ bionemo-esm2-train
2525

2626
### Running
2727

28-
First off, we have a utility function for downloading full/test data and model checkpoints called `download_bionemo_data` that our following examples currently use. This will download the object if it is not already on your local system, and then return the path either way. For example if you run this twice in a row, you should expect the second time you run it to return the path almost instantly.
28+
First off, we have a utility function for downloading full/test data and model checkpoints called `download_bionemo_data` that our following examples currently use. This will download the object if it is not already on your local system, and then return the path either way. For example if you run this twice in a row, you should expect the second time you run it to return the path almost instantly.
2929

3030
**NOTE**: NVIDIA employees should use `pbss` rather than `ngc` for the data source.
3131

0 commit comments

Comments
 (0)