Skip to content

Commit 91188c0

Browse files
committed
Migrate to Beans dataset (MIT License).
Signed-off-by: Cory Ye <cye@nvidia.com>
1 parent d89d170 commit 91188c0

11 files changed

Lines changed: 116 additions & 1465 deletions

recipes/vit/LICENSE

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,3 +199,25 @@
199199
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200200
See the License for the specific language governing permissions and
201201
limitations under the License.
202+
203+
MIT License
204+
205+
Copyright (c) 2020 AIR Lab Makerere University
206+
207+
Permission is hereby granted, free of charge, to any person obtaining a copy
208+
of this software and associated documentation files (the "Software"), to deal
209+
in the Software without restriction, including without limitation the rights
210+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
211+
copies of the Software, and to permit persons to whom the Software is
212+
furnished to do so, subject to the following conditions:
213+
214+
The above copyright notice and this permission notice shall be included in all
215+
copies or substantial portions of the Software.
216+
217+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
218+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
219+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
220+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
221+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
222+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
223+
SOFTWARE.

recipes/vit/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,11 @@ To train a ViT using FSDP, execute the following command in your Docker containe
3838
torchrun --nproc-per-node ${NGPU} train.py --config-name vit_base_patch16_224 distributed.dp_shard=${NGPU} training.checkpoint.path=./ckpts/vit
3939
```
4040

41-
which will train on a local tiny 5-class version of [ImageNet](https://image-net.org/) ([super-tiny-imagenet-5](./data/super-tiny-imagenet-5/)) and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.
41+
which will train on the [`AI-Lab-Makerere/ibean`](https://github.com/AI-Lab-Makerere/ibean/) (HuggingFace: [`AI-Lab-Makerere/beans`](https://huggingface.co/datasets/AI-Lab-Makerere/beans)) dataset and save auto-resumable [Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) checkpoints to the `training.checkpoint.path` directory.
4242

43-
[`train.py`](train.py) is the transparent entrypoint to this script that explains how to modify your own training loop for `Megatron-FSDP` ([PyPI: `megatron-fsdp`](https://pypi.org/project/megatron-fsdp/) / [Source: Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src)) to fully-shard your model across all devices. After executing `train.py` for the first time, the de-compressed ImageNet dataset will be available in `data/super-tiny-imagenet-5/...` (sourced from [`super-tiny-imagenet-5.tar.gz`](./data/super-tiny-imagenet-5.tar.gz)) for experimentation and review.
43+
[`train.py`](train.py) is the transparent entrypoint to this script that explains how to modify your own training loop for `Megatron-FSDP` ([PyPI: `megatron-fsdp`](https://pypi.org/project/megatron-fsdp/) / [Source: Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src)) to fully-shard your model across all devices.
4444

45-
The TIMM-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for ImageNet can be found in [`imagenet_*.py`](imagenet_dataset.py).
45+
The TIMM-derived model code for the ViT can be found in [`vit.py`](vit.py), and data utilities for Beans can be found in [`beans.py`](beans.py).
4646

4747
Various configuration options common in computer vision modeling can be found in [config](./config/).
4848

recipes/vit/beans.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: LicenseRef-Apache2
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
import logging
17+
18+
import torch
19+
from datasets import load_dataset
20+
from torch.utils.data import Dataset
21+
from torchvision.transforms.functional import to_tensor
22+
23+
24+
logger = logging.getLogger(__name__)
25+
26+
27+
def infinite_dataloader(dataloader, sampler):
28+
"""Create an infinite iterator that automatically restarts at the end of each epoch."""
29+
epoch = 0
30+
while True:
31+
sampler.set_epoch(epoch) # Update epoch for proper shuffling
32+
for batch in dataloader:
33+
yield batch
34+
epoch += 1 # Increment epoch counter after completing one full pass
35+
36+
37+
class BeansDataset(Dataset):
38+
"""
39+
Simple wrapper Dataset for AI-Lab-Makerere/beans that converts PIL images to Tensors.
40+
"""
41+
42+
def __init__(self, image_size: tuple[int, int], split: str = "train"):
43+
"""
44+
Args:
45+
image_size (tuple[int, int]): Resize 2-D image data to this size.
46+
split (str): Dataset split to load. Options: ["train", "validation", "test"]
47+
"""
48+
self.resize_dimensions = image_size
49+
# Download Beans Dataset.
50+
self.dataset = load_dataset("AI-Lab-Makerere/beans", split=split)
51+
self.class_list = self.dataset.features["labels"].names
52+
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
53+
logger.info(
54+
f"[AI-Lab-Makerere/beans (Split={split})]\nDataset Size: {len(self.dataset)}\nClasses (Count={len(self.class_list)}): {self.class_list}"
55+
)
56+
57+
def __len__(self):
58+
return len(self.dataset)
59+
60+
def __getitem__(self, idx):
61+
# Preprocess sample.
62+
sample = self.dataset[idx]
63+
image_tensor = to_tensor(sample["image"].resize(self.resize_dimensions).convert("RGB"))
64+
label_idx = sample["labels"]
65+
return image_tensor, label_idx

recipes/vit/checkpoint.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ def load_torch_checkpoint(checkpoint_path, model, megatron_fsdp=False):
3939
checkpoint = torch.load(checkpoint_path, weights_only=False)
4040
# Remove the "module." prefix from the keys of checkpoints
4141
# derived from Megatron-FSDP.
42+
# TODO(@cspades): Remove this when the Megatron-FSDP checkpoint naming is fixed.
4243
model_checkpoint = {(k.removeprefix("module.") if megatron_fsdp else k): v for k, v in checkpoint["model"].items()}
4344
# Warn about Megatron-FSDP checkpoints.
4445
first_key = next(iter(model_checkpoint))
@@ -109,7 +110,7 @@ def load_auto_resume_checkpoint(cfg, model, optimizer):
109110
latest_step_idx = int(latest_subdir.name.split("_")[1])
110111
# Load model and optimizer checkpoints.
111112
load_dcp_checkpoint(latest_subdir, model, optimizer)
112-
if torch.distributed.get_rank() == 0:
113+
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
113114
_logger.info(f"Loaded latest model and optimizer checkpoints from: {latest_subdir}")
114115

115116
# Return the auto-resumed step index for training progression.
@@ -160,5 +161,5 @@ def save_auto_resumable_checkpoint(cfg, model, optimizer, step_idx, loss_value):
160161
# Change file perms.
161162
file_path = Path(dirpath) / filename
162163
os.chmod(file_path, mode)
163-
if torch.distributed.get_rank() == 0:
164+
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
164165
_logger.info(f"Saved validated checkpoint to: {ckpt_dir}")

recipes/vit/config/defaults.yaml

Lines changed: 4 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -60,9 +60,9 @@ fsdp:
6060
preserve_fp32_weights: true
6161

6262
training:
63-
steps: 500
64-
val_interval: 25
65-
log_interval: 5
63+
steps: 10
64+
val_interval: 5
65+
log_interval: 1
6666
checkpoint:
6767
path: null
6868
resume_from_metric: null
@@ -74,53 +74,14 @@ inference:
7474
megatron_fsdp: null
7575

7676
dataset:
77-
num_classes: 100000
77+
num_classes: 3
7878
num_workers: 0
7979
train:
80-
root: null
81-
class_map: null
82-
label_map: null
83-
class_filter: null
8480
batch_size: 1
8581
shuffle: false
86-
transform_kwargs:
87-
img_size: 224
88-
scale: null
89-
ratio: null
90-
train_crop_mode: null
91-
hflip: 0.5
92-
vflip: 0.
93-
color_jitter: 0.4
94-
color_jitter_prob: null
95-
grayscale_prob: 0.
96-
gaussian_blur_prob: 0.
97-
interpolation: 'random'
98-
re_prob: 0.
99-
re_mode: 'const'
100-
re_count: 1
101-
re_num_splits: 0
102-
normalize: True
103-
separate: False
104-
patch_size: 16
105-
patchify: False
10682
val:
107-
root: null
108-
class_map: null
109-
label_map: null
110-
class_filter: null
11183
batch_size: 1
11284
shuffle: false
113-
transform_kwargs:
114-
img_size: 224
115-
crop_pct: null
116-
crop_mode: null
117-
crop_border_pixels: null
118-
interpolation: "bilinear"
119-
mean: [0.485, 0.456, 0.406]
120-
std: [0.229, 0.224, 0.225]
121-
normalize: true
122-
patch_size: 16
123-
patchify: false
12485

12586
random:
12687
seed: 42

recipes/vit/config/vit_base_patch16_224.yaml

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -75,22 +75,14 @@ inference:
7575
megatron_fsdp: true
7676

7777
dataset:
78-
num_classes: 100000
78+
num_classes: 3
7979
num_workers: 4
8080
train:
81-
root: "./data/super-tiny-imagenet-5/train"
82-
class_map: "./data/super-tiny-imagenet-5/words.txt"
83-
label_map: null # Not needed, training data is labeled by directory.
84-
class_filter: null
85-
batch_size: 5
81+
batch_size: 8
8682
shuffle: true
8783
val:
88-
root: "./data/super-tiny-imagenet-5/val"
89-
class_map: "./data/super-tiny-imagenet-5/words.txt"
90-
label_map: "./data/super-tiny-imagenet-5/val/val_annotations.txt"
91-
class_filter: null
92-
batch_size: 5
93-
shuffle: false
84+
batch_size: 16
85+
shuffle: true
9486

9587
random:
9688
seed: 42
-1.74 MB
Binary file not shown.

0 commit comments

Comments
 (0)