Skip to content

Commit a7cb31c

Browse files
committed
run mdformat and bump ruff pre-commit
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
1 parent 53a81c8 commit a7cb31c

62 files changed

Lines changed: 1482 additions & 1364 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/pull_request_template.md

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,51 @@
11
### Description
2+
23
<!-- Provide a detailed description of the changes in this PR -->
34

45
### Type of changes
6+
57
<!-- Mark the relevant option with an [x] -->
68

7-
- [ ] Bug fix (non-breaking change which fixes an issue)
8-
- [ ] New feature (non-breaking change which adds functionality)
9-
- [ ] Refactor
10-
- [ ] Documentation update
11-
- [ ] Other (please describe):
9+
- [ ] Bug fix (non-breaking change which fixes an issue)
10+
- [ ] New feature (non-breaking change which adds functionality)
11+
- [ ] Refactor
12+
- [ ] Documentation update
13+
- [ ] Other (please describe):
1214

1315
### CI Pipeline Configuration
16+
1417
Configure CI behavior by applying the relevant labels:
1518

1619
- [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests
1720
- [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest
1821
- [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing
1922

20-
> [!NOTE]
23+
> \[!NOTE\]
2124
> By default, the notebooks validation tests are skipped unless explicitly enabled.
2225
2326
#### Authorizing CI Runs
2427

2528
We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI
2629
runs on NVIDIA's compute resources.
2730

28-
* If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
31+
- If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
2932
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
30-
* If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
33+
- If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
3134
`/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit.
3235

3336
### Usage
37+
3438
<!--- How does a user interact with the changed code -->
39+
3540
```python
36-
TODO: Add code snippet
41+
# TODO: Add code snippet
3742
```
3843

3944
### Pre-submit Checklist
45+
4046
<!--- Ensure all items are completed before submitting -->
4147

42-
- [ ] I have tested these changes locally
43-
- [ ] I have updated the documentation accordingly
44-
- [ ] I have added/updated tests as needed
45-
- [ ] All existing tests pass successfully
48+
- [ ] I have tested these changes locally
49+
- [ ] I have updated the documentation accordingly
50+
- [ ] I have added/updated tests as needed
51+
- [ ] All existing tests pass successfully

.mdformat.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
number = true # options: {false, true}

.pre-commit-config.yaml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,21 @@ repos:
77
- id: check-yaml
88
exclude: "mkdocs.yml"
99
- repo: https://github.com/astral-sh/ruff-pre-commit
10-
rev: v0.9.10
10+
rev: v0.12.8
1111
hooks:
1212
- id: ruff
1313
# 1. Attempt to automatically fix any lint issues.
1414
args: ["--fix"]
1515
- id: ruff-format
16+
- repo: https://github.com/executablebooks/mdformat
17+
rev: 0.7.17 # Use the latest stable version
18+
hooks:
19+
- id: mdformat
20+
additional_dependencies:
21+
- mdformat-tables
22+
- mdformat-gfm
23+
- mdformat-black
24+
- mdformat-frontmatter
1625
- repo: https://github.com/Yelp/detect-secrets
1726
rev: v1.5.0
1827
hooks:

README.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ The `bionemo-framework` is organized into independently installable namespace pa
2121
`sub-packages/` directory. Please refer to [PEP 420 – Implicit Namespace Packages](https://peps.python.org/pep-0420/)
2222
for details.
2323

24-
2524
## Documentation Resources
2625

2726
- **Official Documentation:** For user guides, API references, and troubleshooting, visit our [official documentation](https://docs.nvidia.com/bionemo-framework/latest/).
@@ -62,7 +61,6 @@ git submodule update --init --recursive
6261

6362
Different branches of the repo can have different pinned versions of these third-party submodules. Ensure submodules are automatically updated after switching branches or pulling updates by configuring git with:
6463

65-
6664
```bash
6765
git config submodule.recurse true
6866
```
@@ -72,21 +70,19 @@ You will have to run the full `git submodule update --init --recursive` command
7270

7371
#### Build the Docker Image Locally
7472

75-
7673
With a locally cloned repository and initialized submodules, build the BioNeMo container using:
7774

7875
```bash
7976
docker buildx build . -t my-container-tag
8077
```
8178

82-
8379
#### VSCode Devcontainer for Interactive Debugging
8480

8581
We distribute a [development container](https://devcontainers.github.io/) configuration for vscode
8682
(`.devcontainer/devcontainer.json`) that simplifies the process of local testing and development. Opening the
8783
bionemo-framework folder with VSCode should prompt you to re-open the folder inside the devcontainer environment.
8884

89-
> [!NOTE]
85+
> \[!NOTE\]
9086
> The first time you launch the devcontainer, it may take a long time to build the image. Building the image locally
9187
> (using the command shown above) will ensure that most of the layers are present in the local docker cache.
9288

SECURITY.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,16 @@ If you need to report a security issue, please use the appropriate contact point
77
## Reporting Potential Security Vulnerability in an NVIDIA Product
88

99
To report a potential security vulnerability in any NVIDIA product:
10+
1011
- Web: [Security Vulnerability Submission Form](https://www.nvidia.com/object/submit-security-vulnerability.html)
1112
- E-Mail: psirt@nvidia.com
12-
- We encourage you to use the following PGP key for secure email communication: [NVIDIA public PGP Key for communication](https://www.nvidia.com/en-us/security/pgp-key)
13-
- Please include the following information:
14-
- Product/Driver name and version/branch that contains the vulnerability
15-
- Type of vulnerability (code execution, denial of service, buffer overflow, etc.)
16-
- Instructions to reproduce the vulnerability
17-
- Proof-of-concept or exploit code
18-
- Potential impact of the vulnerability, including how an attacker could exploit the vulnerability
13+
- We encourage you to use the following PGP key for secure email communication: [NVIDIA public PGP Key for communication](https://www.nvidia.com/en-us/security/pgp-key)
14+
- Please include the following information:
15+
- Product/Driver name and version/branch that contains the vulnerability
16+
- Type of vulnerability (code execution, denial of service, buffer overflow, etc.)
17+
- Instructions to reproduce the vulnerability
18+
- Proof-of-concept or exploit code
19+
- Potential impact of the vulnerability, including how an attacker could exploit the vulnerability
1920

2021
While NVIDIA currently does not have a bug bounty program, we do offer acknowledgement when an externally reported security issue is addressed under our coordinated vulnerability disclosure policy. Please visit our [Product Security Incident Response Team (PSIRT)](https://www.nvidia.com/en-us/security/psirt-policies/) policies page for more information.
2122

docs/docs/index.md

Lines changed: 16 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -3,50 +3,45 @@ hide:
33
- navigation
44
---
55

6-
76
**NVIDIA BioNeMo Framework** is a collection of programming tools, libraries, and models for computational drug
87
discovery. It accelerates the most time-consuming and costly stages of building and adapting biomolecular AI models by
98
providing domain-specific, optimized models and tooling that are easily integrated into GPU-based computational
109
resources for the fastest performance on the market. You can access BioNeMo Framework as a free community resource or
1110
learn more about getting an enterprise license for improved expert-level support at the
1211
[BioNeMo homepage](https://www.nvidia.com/en-us/clara/bionemo/).
1312

14-
1513
<div class="grid cards" markdown>
1614

17-
- :material-book-open-variant:{ .lg } __User Guide__
18-
19-
---
20-
21-
Install BioNeMo and set up your environment to start accelerating your bioinformatics workflows.
22-
23-
[Get Started](main/about/overview/){ .md-button .md-button }
15+
- :material-book-open-variant:{ .lg } __User Guide__
2416

25-
- :material-code-greater-than:{ .lg } __API Reference__
17+
______________________________________________________________________
2618

27-
---
19+
Install BioNeMo and set up your environment to start accelerating your bioinformatics workflows.
2820

29-
Access comprehensive documentation on BioNeMo's sub-packages, functions, and classes.
21+
[Get Started](main/about/overview/){ .md-button .md-button }
3022

31-
[API Reference](main/references/API_reference/bionemo/core/api/){ .md-button .md-button }
23+
- :material-code-greater-than:{ .lg } __API Reference__
3224

33-
- :material-cube-outline:{ .lg } __Models__
25+
______________________________________________________________________
3426

35-
---
27+
Access comprehensive documentation on BioNeMo's sub-packages, functions, and classes.
3628

37-
Explore detailed instructions and best practices for using BioNeMo models in your research.
29+
[API Reference](main/references/API_reference/bionemo/core/api/){ .md-button .md-button }
3830

39-
[Explore Models](models){ .md-button .md-button }
31+
- :material-cube-outline:{ .lg } __Models__
4032

33+
______________________________________________________________________
4134

35+
Explore detailed instructions and best practices for using BioNeMo models in your research.
4236

43-
- :material-database-outline:{ .lg } __Datasets__
37+
[Explore Models](models){ .md-button .md-button }
4438

45-
---
39+
- :material-database-outline:{ .lg } __Datasets__
4640

47-
Explore biomolecular datasets that come pre-packaged with the BioNeMo Framework.
41+
______________________________________________________________________
4842

49-
[Explore Datasets](main/datasets/){ .md-button .md-button }
43+
Explore biomolecular datasets that come pre-packaged with the BioNeMo Framework.
5044

45+
[Explore Datasets](main/datasets/){ .md-button .md-button }
5146

5247
</div>

docs/docs/main/about/background/megatron_datasets.md

Lines changed: 27 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,17 @@ consequence, ensure that the new dataset classes preserve the required determini
66
augmentation and masking can cause `dataset[i]` to return random results for a given index, breaking this megatron
77
contract.
88

9-
109
## Multi-Epoch Training
1110

1211
One training regime where this limitation is most apparent is multi-epoch training, where standard training recipes
1312
would apply different random masks or different data augmentation strategies each time the data is encountered. BioNeMo
1413
provides some utilities that make multi-epoch training easier, while obeying the determinism requirements of
1514
megatron.
1615

17-
The [MultiEpochDatasetResampler][bionemo.core.data.multi_epoch_dataset.MultiEpochDatasetResampler] class simplifies the
16+
The \[MultiEpochDatasetResampler\]\[bionemo.core.data.multi_epoch_dataset.MultiEpochDatasetResampler\] class simplifies the
1817
process of multi-epoch training, where the data should both be re-shuffled each epoch with different random effects
1918
applied each time the data is seen. To be compatible with this resampler, the provided dataset class's `__getitem__`
20-
method should accept a [EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] tuple that contains both an epoch
19+
method should accept a \[EpochIndex\]\[bionemo.core.data.multi_epoch_dataset.EpochIndex\] tuple that contains both an epoch
2120
and index value. Random effects can then be performed by setting the torch random seed based on the epoch value:
2221

2322
```python
@@ -30,28 +29,31 @@ class MyDataset:
3029

3130
!!! bug "Avoid `torch.manual_seed`"
3231

33-
Megatron-LM handles torch seeding internally. Calling `torch.cuda.manual_seed` inside the user-provided dataset
34-
can cause issues with model parallelism. See [megatron/core/tensor_parallel/random.py#L198-L199](
35-
https://github.com/NVIDIA/Megatron-LM/blob/dddecd19/megatron/core/tensor_parallel/random.py#L198-L199) for more
36-
details.
32+
```
33+
Megatron-LM handles torch seeding internally. Calling `torch.cuda.manual_seed` inside the user-provided dataset
34+
can cause issues with model parallelism. See [megatron/core/tensor_parallel/random.py#L198-L199](
35+
https://github.com/NVIDIA/Megatron-LM/blob/dddecd19/megatron/core/tensor_parallel/random.py#L198-L199) for more
36+
details.
37+
```
3738

3839
For deterministic datasets that still want to train for multiple epochs with epoch-level shuffling, the
39-
[IdentityMultiEpochDatasetWrapper][bionemo.core.data.multi_epoch_dataset.IdentityMultiEpochDatasetWrapper] class can
40+
\[IdentityMultiEpochDatasetWrapper\]\[bionemo.core.data.multi_epoch_dataset.IdentityMultiEpochDatasetWrapper\] class can
4041
simplify this process by wrapping a dataset that accepts integer indices and passes along the
41-
[EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] index values from the resampled dataset.
42+
\[EpochIndex\]\[bionemo.core.data.multi_epoch_dataset.EpochIndex\] index values from the resampled dataset.
4243

4344
```python
4445
class MyDeterministicDataset:
45-
def __getitem__(self, index: int):
46-
...
46+
def __getitem__(self, index: int): ...
47+
4748

4849
dataset = IdentityMultiEpochDatasetWrapper(MyDeterministicDataset())
4950
for sample in MultiEpochDatasetResampler(dataset, num_epochs=3, shuffle=True):
5051
...
5152
```
5253

5354
## Training Resumption
54-
To ensure identical behavior with and without job interruption, BioNeMo provides [MegatronDataModule][bionemo.llm.data.datamodule.MegatronDataModule] to save and load state dict for training resumption, and provides [WrappedDataLoader][nemo.lightning.data.WrappedDataLoader] to add a `mode` attribute to [DataLoader][torch.utils.data.DataLoader].
55+
56+
To ensure identical behavior with and without job interruption, BioNeMo provides \[MegatronDataModule\]\[bionemo.llm.data.datamodule.MegatronDataModule\] to save and load state dict for training resumption, and provides \[WrappedDataLoader\]\[nemo.lightning.data.WrappedDataLoader\] to add a `mode` attribute to \[DataLoader\]\[torch.utils.data.DataLoader\].
5557

5658
```python
5759
class MyDataModule(MegatronDataModule):
@@ -83,23 +85,29 @@ class MyDataModule(MegatronDataModule):
8385

8486
!!! note "MegatronDataModule"
8587

86-
Users will see non-overlapping training curve if their datamodule is not inheritting from `MegatronDataModule`, unless similar logics are handled by the users. In `MegatronDataModule`, `self.update_init_global_step()` must be called right before the dataloaders are returned to ensure that training resumes with the correct sample index instead of restarting from 0 everytime. We recommend users to inherit from `MegatronDataModule` similar to the pattern above.
88+
```
89+
Users will see non-overlapping training curve if their datamodule is not inheritting from `MegatronDataModule`, unless similar logics are handled by the users. In `MegatronDataModule`, `self.update_init_global_step()` must be called right before the dataloaders are returned to ensure that training resumes with the correct sample index instead of restarting from 0 everytime. We recommend users to inherit from `MegatronDataModule` similar to the pattern above.
90+
```
8791

8892
!!! note "WrappedDataLoader"
8993

90-
The `WrappedDataLoader` class is a wrapper around the PyTorch DataLoader class that adds the `mode` attribute to the dataloader. The dataloader will resume from the last sample index only when mode is 'train'. `val_dataloader` and `test_dataloader` are unaffected.
94+
```
95+
The `WrappedDataLoader` class is a wrapper around the PyTorch DataLoader class that adds the `mode` attribute to the dataloader. The dataloader will resume from the last sample index only when mode is 'train'. `val_dataloader` and `test_dataloader` are unaffected.
9196
92-
WARNING: 'train' is the default value of `mode` in `WrappedDataLoader`. If not set, users might find their validation/test dataloader changes behavior by resuming from a non-zero sample index.
97+
WARNING: 'train' is the default value of `mode` in `WrappedDataLoader`. If not set, users might find their validation/test dataloader changes behavior by resuming from a non-zero sample index.
98+
```
9399

94100
## Testing Datasets for Megatron Compatibility
95101

96102
BioNeMo also provides utility functions for test suites to validate that datasets conform to the megatron data model.
97-
The [assert_dataset_compatible_with_megatron][bionemo.testing.data_utils.assert_dataset_compatible_with_megatron]
103+
The \[assert_dataset_compatible_with_megatron\]\[bionemo.testing.data_utils.assert_dataset_compatible_with_megatron\]
98104
function calls the dataset with identical indices and ensures the outputs are identical, while also checking to see if
99105
`torch.manual_seed` was used.
100106

101107
!!! example "Example datasets in BioNeMo"
102108

103-
The [ESMMaskedResidueDataset][bionemo.esm2.data.dataset.ESMMaskedResidueDataset] demonstrates one approach for
104-
leveraging [EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] indices to perform epoch-level
105-
randomization within the confines of megatron's data model.
109+
```
110+
The [ESMMaskedResidueDataset][bionemo.esm2.data.dataset.ESMMaskedResidueDataset] demonstrates one approach for
111+
leveraging [EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] indices to perform epoch-level
112+
randomization within the confines of megatron's data model.
113+
```

0 commit comments

Comments
 (0)