NVIDIA
diff --git a/‎.github/pull_request_template.md‎
Lines changed: 19 additions & 13 deletions b/‎.github/pull_request_template.md‎
Lines changed: 19 additions & 13 deletions
diff --git a/‎.mdformat.toml‎
Lines changed: 1 addition & 0 deletions b/‎.mdformat.toml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 10 additions & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 10 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 1 addition & 5 deletions b/‎README.md‎
Lines changed: 1 addition & 5 deletions
diff --git a/‎SECURITY.md‎
Lines changed: 8 additions & 7 deletions b/‎SECURITY.md‎
Lines changed: 8 additions & 7 deletions
diff --git a/‎docs/docs/index.md‎
Lines changed: 16 additions & 21 deletions b/‎docs/docs/index.md‎
Lines changed: 16 additions & 21 deletions
diff --git a/‎docs/docs/main/about/background/megatron_datasets.md‎
Lines changed: 27 additions & 19 deletions b/‎docs/docs/main/about/background/megatron_datasets.md‎
Lines changed: 27 additions & 19 deletions
@@ -1,45 +1,51 @@
 ### Description
+
 <!-- Provide a detailed description of the changes in this PR -->
 
 ### Type of changes
+
 <!-- Mark the relevant option with an [x] -->
 
-- [ ]  Bug fix (non-breaking change which fixes an issue)
-- [ ]  New feature (non-breaking change which adds functionality)
-- [ ]  Refactor
-- [ ]  Documentation update
-- [ ]  Other (please describe):
+- [ ] Bug fix (non-breaking change which fixes an issue)
+- [ ] New feature (non-breaking change which adds functionality)
+- [ ] Refactor
+- [ ] Documentation update
+- [ ] Other (please describe):
 
 ### CI Pipeline Configuration
+
 Configure CI behavior by applying the relevant labels:
 
 - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests
 - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest
 - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing
 
-> [!NOTE]
+> \[!NOTE\]
 > By default, the notebooks validation tests are skipped unless explicitly enabled.
 
 #### Authorizing CI Runs
 
 We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI
 runs on NVIDIA's compute resources.
 
-* If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
+- If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
   automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
-* If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
+- If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
   `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit.
 
 ### Usage
+
 <!--- How does a user interact with the changed code -->
+
 ```python
-TODO: Add code snippet
+# TODO: Add code snippet
 ```
 
 ### Pre-submit Checklist
+
 <!--- Ensure all items are completed before submitting -->
 
- - [ ] I have tested these changes locally
- - [ ] I have updated the documentation accordingly
- - [ ] I have added/updated tests as needed
- - [ ] All existing tests pass successfully
+- [ ] I have tested these changes locally
+- [ ] I have updated the documentation accordingly
+- [ ] I have added/updated tests as needed
+- [ ] All existing tests pass successfully
@@ -0,0 +1 @@
+number = true        # options: {false, true}
@@ -7,12 +7,21 @@ repos:
       - id: check-yaml
         exclude: "mkdocs.yml"
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.9.10
+    rev: v0.12.8
     hooks:
       - id: ruff
         # 1. Attempt to automatically fix any lint issues.
         args: ["--fix"]
       - id: ruff-format
+  - repo: https://github.com/executablebooks/mdformat
+    rev: 0.7.17             # Use the latest stable version
+    hooks:
+      - id: mdformat
+        additional_dependencies:
+        - mdformat-tables
+        - mdformat-gfm
+        - mdformat-black
+        - mdformat-frontmatter
   - repo: https://github.com/Yelp/detect-secrets
     rev: v1.5.0
     hooks:
 
@@ -21,7 +21,6 @@ The `bionemo-framework` is organized into independently installable namespace pa
 `sub-packages/` directory. Please refer to [PEP 420 – Implicit Namespace Packages](https://peps.python.org/pep-0420/)
 for details.
 
-
 ## Documentation Resources
 
 - **Official Documentation:** For user guides, API references, and troubleshooting, visit our [official documentation](https://docs.nvidia.com/bionemo-framework/latest/).
@@ -62,7 +61,6 @@ git submodule update --init --recursive
 
 Different branches of the repo can have different pinned versions of these third-party submodules. Ensure submodules are automatically updated after switching branches or pulling updates by configuring git with:
 
-
 ```bash
 git config submodule.recurse true
 ```
@@ -72,21 +70,19 @@ You will have to run the full `git submodule update --init --recursive` command
 
 #### Build the Docker Image Locally
 
-
 With a locally cloned repository and initialized submodules, build the BioNeMo container using:
 
 ```bash
 docker buildx build . -t my-container-tag
 ```
 
-
 #### VSCode Devcontainer for Interactive Debugging
 
 We distribute a [development container](https://devcontainers.github.io/) configuration for vscode
 (`.devcontainer/devcontainer.json`) that simplifies the process of local testing and development. Opening the
 bionemo-framework folder with VSCode should prompt you to re-open the folder inside the devcontainer environment.
 
-> [!NOTE]
+> \[!NOTE\]
 > The first time you launch the devcontainer, it may take a long time to build the image. Building the image locally
 > (using the command shown above) will ensure that most of the layers are present in the local docker cache.
 
 
@@ -7,15 +7,16 @@ If you need to report a security issue, please use the appropriate contact point
 ## Reporting Potential Security Vulnerability in an NVIDIA Product
 
 To report a potential security vulnerability in any NVIDIA product:
+
 - Web: [Security Vulnerability Submission Form](https://www.nvidia.com/object/submit-security-vulnerability.html)
 - E-Mail: psirt@nvidia.com
-    - We encourage you to use the following PGP key for secure email communication: [NVIDIA public PGP Key for communication](https://www.nvidia.com/en-us/security/pgp-key)
-    - Please include the following information:
-        - Product/Driver name and version/branch that contains the vulnerability
-        - Type of vulnerability (code execution, denial of service, buffer overflow, etc.)
-        - Instructions to reproduce the vulnerability
-        - Proof-of-concept or exploit code
-        - Potential impact of the vulnerability, including how an attacker could exploit the vulnerability
+  - We encourage you to use the following PGP key for secure email communication: [NVIDIA public PGP Key for communication](https://www.nvidia.com/en-us/security/pgp-key)
+  - Please include the following information:
+    - Product/Driver name and version/branch that contains the vulnerability
+    - Type of vulnerability (code execution, denial of service, buffer overflow, etc.)
+    - Instructions to reproduce the vulnerability
+    - Proof-of-concept or exploit code
+    - Potential impact of the vulnerability, including how an attacker could exploit the vulnerability
 
 While NVIDIA currently does not have a bug bounty program, we do offer acknowledgement when an externally reported security issue is addressed under our coordinated vulnerability disclosure policy. Please visit our [Product Security Incident Response Team (PSIRT)](https://www.nvidia.com/en-us/security/psirt-policies/) policies page for more information.
 
 
@@ -3,50 +3,45 @@ hide:
   - navigation
 ---
 
-
 **NVIDIA BioNeMo Framework** is a collection of programming tools, libraries, and models for computational drug
 discovery. It accelerates the most time-consuming and costly stages of building and adapting biomolecular AI models by
 providing domain-specific, optimized models and tooling that are easily integrated into GPU-based computational
 resources for the fastest performance on the market. You can access BioNeMo Framework as a free community resource or
 learn more about getting an enterprise license for improved expert-level support at the
 [BioNeMo homepage](https://www.nvidia.com/en-us/clara/bionemo/).
 
-
 <div class="grid cards" markdown>
 
--   :material-book-open-variant:{ .lg } __User Guide__
-
-    ---
-
-    Install BioNeMo and set up your environment to start accelerating your bioinformatics workflows.
-
-    [Get Started](main/about/overview/){ .md-button .md-button }
+- :material-book-open-variant:{ .lg } __User Guide__
 
--   :material-code-greater-than:{ .lg } __API Reference__
+  ______________________________________________________________________
 
-    ---
+  Install BioNeMo and set up your environment to start accelerating your bioinformatics workflows.
 
-    Access comprehensive documentation on BioNeMo's sub-packages, functions, and classes.
+  [Get Started](main/about/overview/){ .md-button .md-button }
 
-    [API Reference](main/references/API_reference/bionemo/core/api/){ .md-button .md-button }
+- :material-code-greater-than:{ .lg } __API Reference__
 
--   :material-cube-outline:{ .lg } __Models__
+  ______________________________________________________________________
 
-    ---
+  Access comprehensive documentation on BioNeMo's sub-packages, functions, and classes.
 
-    Explore detailed instructions and best practices for using BioNeMo models in your research.
+  [API Reference](main/references/API_reference/bionemo/core/api/){ .md-button .md-button }
 
-    [Explore Models](models){ .md-button .md-button }
+- :material-cube-outline:{ .lg } __Models__
 
+  ______________________________________________________________________
 
+  Explore detailed instructions and best practices for using BioNeMo models in your research.
 
--   :material-database-outline:{ .lg } __Datasets__
+  [Explore Models](models){ .md-button .md-button }
 
-    ---
+- :material-database-outline:{ .lg } __Datasets__
 
-    Explore biomolecular datasets that come pre-packaged with the BioNeMo Framework.
+  ______________________________________________________________________
 
-    [Explore Datasets](main/datasets/){ .md-button .md-button }
+  Explore biomolecular datasets that come pre-packaged with the BioNeMo Framework.
 
+  [Explore Datasets](main/datasets/){ .md-button .md-button }
 
 </div>
@@ -6,18 +6,17 @@ consequence, ensure that the new dataset classes preserve the required determini
 augmentation and masking can cause `dataset[i]` to return random results for a given index, breaking this megatron
 contract.
 
-
 ## Multi-Epoch Training
 
 One training regime where this limitation is most apparent is multi-epoch training, where standard training recipes
 would apply different random masks or different data augmentation strategies each time the data is encountered. BioNeMo
 provides some utilities that make multi-epoch training easier, while obeying the determinism requirements of
 megatron.
 
-The [MultiEpochDatasetResampler][bionemo.core.data.multi_epoch_dataset.MultiEpochDatasetResampler] class simplifies the
+The \[MultiEpochDatasetResampler\]\[bionemo.core.data.multi_epoch_dataset.MultiEpochDatasetResampler\] class simplifies the
 process of multi-epoch training, where the data should both be re-shuffled each epoch with different random effects
 applied each time the data is seen. To be compatible with this resampler, the provided dataset class's `__getitem__`
-method should accept a [EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] tuple that contains both an epoch
+method should accept a \[EpochIndex\]\[bionemo.core.data.multi_epoch_dataset.EpochIndex\] tuple that contains both an epoch
 and index value. Random effects can then be performed by setting the torch random seed based on the epoch value:
 
 ```python
@@ -30,28 +29,31 @@ class MyDataset:
 
 !!! bug "Avoid `torch.manual_seed`"
 
-    Megatron-LM handles torch seeding internally. Calling `torch.cuda.manual_seed` inside the user-provided dataset
-    can cause issues with model parallelism. See [megatron/core/tensor_parallel/random.py#L198-L199](
-    https://github.com/NVIDIA/Megatron-LM/blob/dddecd19/megatron/core/tensor_parallel/random.py#L198-L199) for more
-    details.
+```
+Megatron-LM handles torch seeding internally. Calling `torch.cuda.manual_seed` inside the user-provided dataset
+can cause issues with model parallelism. See [megatron/core/tensor_parallel/random.py#L198-L199](
+https://github.com/NVIDIA/Megatron-LM/blob/dddecd19/megatron/core/tensor_parallel/random.py#L198-L199) for more
+details.
+```
 
 For deterministic datasets that still want to train for multiple epochs with epoch-level shuffling, the
-[IdentityMultiEpochDatasetWrapper][bionemo.core.data.multi_epoch_dataset.IdentityMultiEpochDatasetWrapper] class can
+\[IdentityMultiEpochDatasetWrapper\]\[bionemo.core.data.multi_epoch_dataset.IdentityMultiEpochDatasetWrapper\] class can
 simplify this process by wrapping a dataset that accepts integer indices and passes along the
-[EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] index values from the resampled dataset.
+\[EpochIndex\]\[bionemo.core.data.multi_epoch_dataset.EpochIndex\] index values from the resampled dataset.
 
 ```python
 class MyDeterministicDataset:
-    def __getitem__(self, index: int):
-        ...
+    def __getitem__(self, index: int): ...
+
 
 dataset = IdentityMultiEpochDatasetWrapper(MyDeterministicDataset())
 for sample in MultiEpochDatasetResampler(dataset, num_epochs=3, shuffle=True):
     ...
 ```
 
 ## Training Resumption
-To ensure identical behavior with and without job interruption, BioNeMo provides [MegatronDataModule][bionemo.llm.data.datamodule.MegatronDataModule] to save and load state dict for training resumption, and provides [WrappedDataLoader][nemo.lightning.data.WrappedDataLoader] to add a `mode` attribute to [DataLoader][torch.utils.data.DataLoader].
+
+To ensure identical behavior with and without job interruption, BioNeMo provides \[MegatronDataModule\]\[bionemo.llm.data.datamodule.MegatronDataModule\] to save and load state dict for training resumption, and provides \[WrappedDataLoader\]\[nemo.lightning.data.WrappedDataLoader\] to add a `mode` attribute to \[DataLoader\]\[torch.utils.data.DataLoader\].
 
 ```python
 class MyDataModule(MegatronDataModule):
@@ -83,23 +85,29 @@ class MyDataModule(MegatronDataModule):
 
 !!! note "MegatronDataModule"
 
-    Users will see non-overlapping training curve if their datamodule is not inheritting from `MegatronDataModule`, unless similar logics are handled by the users. In `MegatronDataModule`, `self.update_init_global_step()` must be called right before the dataloaders are returned to ensure that training resumes with the correct sample index instead of restarting from 0 everytime. We recommend users to inherit from `MegatronDataModule` similar to the pattern above.
+```
+Users will see non-overlapping training curve if their datamodule is not inheritting from `MegatronDataModule`, unless similar logics are handled by the users. In `MegatronDataModule`, `self.update_init_global_step()` must be called right before the dataloaders are returned to ensure that training resumes with the correct sample index instead of restarting from 0 everytime. We recommend users to inherit from `MegatronDataModule` similar to the pattern above.
+```
 
 !!! note "WrappedDataLoader"
 
-    The `WrappedDataLoader` class is a wrapper around the PyTorch DataLoader class that adds the `mode` attribute to the dataloader. The dataloader will resume from the last sample index only when mode is 'train'. `val_dataloader` and `test_dataloader` are unaffected.
+```
+The `WrappedDataLoader` class is a wrapper around the PyTorch DataLoader class that adds the `mode` attribute to the dataloader. The dataloader will resume from the last sample index only when mode is 'train'. `val_dataloader` and `test_dataloader` are unaffected.
 
-    WARNING: 'train' is the default value of `mode` in `WrappedDataLoader`. If not set, users might find their validation/test dataloader changes behavior by resuming from a non-zero sample index.
+WARNING: 'train' is the default value of `mode` in `WrappedDataLoader`. If not set, users might find their validation/test dataloader changes behavior by resuming from a non-zero sample index.
+```
 
 ## Testing Datasets for Megatron Compatibility
 
 BioNeMo also provides utility functions for test suites to validate that datasets conform to the megatron data model.
-The [assert_dataset_compatible_with_megatron][bionemo.testing.data_utils.assert_dataset_compatible_with_megatron]
+The \[assert_dataset_compatible_with_megatron\]\[bionemo.testing.data_utils.assert_dataset_compatible_with_megatron\]
 function calls the dataset with identical indices and ensures the outputs are identical, while also checking to see if
 `torch.manual_seed` was used.
 
 !!! example "Example datasets in BioNeMo"
 
-    The [ESMMaskedResidueDataset][bionemo.esm2.data.dataset.ESMMaskedResidueDataset] demonstrates one approach for
-    leveraging [EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] indices to perform epoch-level
-    randomization within the confines of megatron's data model.
+```
+The [ESMMaskedResidueDataset][bionemo.esm2.data.dataset.ESMMaskedResidueDataset] demonstrates one approach for
+leveraging [EpochIndex][bionemo.core.data.multi_epoch_dataset.EpochIndex] indices to perform epoch-level
+randomization within the confines of megatron's data model.
+```