Reproducibility finished - changed hyperparameters to those from the paper, updated the readme with info about reproducibility, new version.

gabikadlecova · gabikadlecova · commit d0a62e9d1ce3 · 2022-12-05T11:37:57.000+01:00
diff --git a/Contributors.md b/Contributors.md
@@ -6,6 +6,7 @@
   - Package structure, reproducibility
 ---------
 - [@abhash-er](https://github.com/abhash-er/) (Abhash Jha)
-  - modified the model code so that cast to double is possible
+  - Modified the model code so that cast to double is possible
 - [@longerHost](https://github.com/longerHost)
-  - reproducibility of the original NAS-Bench-101 results
+  - Reproducibility of the original NAS-Bench-101
+  - Comparison of training results and API results
diff --git a/README.md b/README.md
@@ -5,6 +5,9 @@ implementation is written in TensorFlow, and this projects contains
 some files from the original repository (in the directory
 `nasbench_pytorch/model/`).
 
+**Important:** if you want to reproduce the original results, please refer to the
+[Reproducibility](#repro) section.
+
 # Overview
 A PyTorch implementation of *training* of NAS-Bench-101 dataset: [NAS-Bench-101: Towards Reproducible Neural Architecture Search](https://arxiv.org/abs/1902.09635).
 The dataset contains 423,624 unique neural networks exhaustively generated and evaluated from a fixed graph-based search space.
@@ -64,13 +67,25 @@ Then, you can train it just like the example network in `main.py`.
 Example architecture (picture from the original repository)
 ![archtecture](./assets/architecture.png)
 
+# Reproducibility <a id="repro"></a>
+The code should closely match the TensorFlow version (including the hyperparameters), but there are some differences:
+- RMSProp implementation in TensorFlow and PyTorch is **different**
+  - For more information refer to [here](https://github.com/pytorch/pytorch/issues/32545) and [here](https://github.com/pytorch/pytorch/issues/23796).
+  - Optionally, you can install pytorch-image-models where a [TensorFlow-like RMSProp](https://github.com/rwightman/pytorch-image-models/blob/main/timm/optim/rmsprop_tf.py#L5) is implemented
+    - `pip install timm`
+  - Then, pass `--optimizer rmsprop_tf` to `main.py` to use it
+
+
+- The original training was on TPUs, this code enables only GPU and CPU training
+- Input data augmentation methods are the same, but due to randomness they are not applied in the same manner
+  - Cause: Batches and images cannot be shuffled as in the original TPU training, and the augmentation seed is also different
+- Results may still differ due to TensorFlow/PyTorch implementation differences
+
+Refer to this [issue](https://github.com/romulus0914/NASBench-PyTorch/issues/6) for more information and for comparison with API results.
+
 # Disclaimer
 Modified from [NASBench: A Neural Architecture Search Dataset and Benchmark](https://github.com/google-research/nasbench).
 *graph_util.py* and *model_spec.py* are directly copied from the original repo. Original license can be found [here](https://github.com/google-research/nasbench/blob/master/LICENSE).
 
 <a id="note"></a>
 **Please note that this repo is only used to train one possible architecture in the search space, not to generate all possible graphs and train them.
-
-**Important information:** The code should closely match the TensorFlow version, but
-you may still get slightly different results due to differences in TensorFlow/PyTorch implementation.
-Moreover, input data augmentation is the same, but due to randomness isn't exactly the same.
diff --git a/main.py b/main.py
@@ -41,27 +41,24 @@ def reload_checkpoint(path, device=None):
     parser = argparse.ArgumentParser(description='NASBench')
     parser.add_argument('--random_state', default=1, type=int, help='Random seed.')
     parser.add_argument('--data_root', default='./data/', type=str, help='Path where cifar will be downloaded.')
-    parser.add_argument('--module_vertices', default=7, type=int, help='#vertices in graph')
-    parser.add_argument('--max_edges', default=9, type=int, help='max edges in graph')
-    parser.add_argument('--available_ops', default=['conv3x3-bn-relu', 'conv1x1-bn-relu', 'maxpool3x3'],
-                        type=list, help='available operations performed on vertex')
     parser.add_argument('--in_channels', default=3, type=int, help='Number of input channels.')
     parser.add_argument('--stem_out_channels', default=128, type=int, help='output channels of stem convolution')
     parser.add_argument('--num_stacks', default=3, type=int, help='#stacks of modules')
     parser.add_argument('--num_modules_per_stack', default=3, type=int, help='#modules per stack')
-    parser.add_argument('--batch_size', default=128, type=int, help='batch size')
-    parser.add_argument('--test_batch_size', default=100, type=int, help='test set batch size')
-    parser.add_argument('--epochs', default=100, type=int, help='#epochs of training')
+    parser.add_argument('--batch_size', default=256, type=int, help='batch size')
+    parser.add_argument('--test_batch_size', default=256, type=int, help='test set batch size')
+    parser.add_argument('--epochs', default=108, type=int, help='#epochs of training')
     parser.add_argument('--validation_size', default=10000, type=int, help="Size of the validation set to split off.")
     parser.add_argument('--num_workers', default=0, type=int, help="Number of parallel workers for the train dataset.")
-    parser.add_argument('--learning_rate', default=0.025, type=float, help='base learning rate')
+    parser.add_argument('--learning_rate', default=0.02, type=float, help='base learning rate')
     parser.add_argument('--lr_decay_method', default='COSINE_BY_STEP', type=str, help='learning decay method')
-    parser.add_argument('--optimizer', default='sgd', type=str, help='Optimizer (sgd or rmsprop)')
-    parser.add_argument('--rmsprop_eps', default=1e-08, type=float, help='RMSProp eps parameter.')
+    parser.add_argument('--optimizer', default='rmsprop', type=str, help='Optimizer (sgd, rmsprop or rmsprop_tf)')
+    parser.add_argument('--rmsprop_eps', default=1.0, type=float, help='RMSProp eps parameter.')
     parser.add_argument('--momentum', default=0.9, type=float, help='momentum')
     parser.add_argument('--weight_decay', default=1e-4, type=float, help='L2 regularization weight')   
     parser.add_argument('--grad_clip', default=5, type=float, help='gradient clipping')
-    parser.add_argument('--batch_norm_momentum', default=0.1, type=float, help='Batch normalization momentum')
+    parser.add_argument('--grad_clip_off', default=False, type=bool, help='If True, turn off gradient clipping.')
+    parser.add_argument('--batch_norm_momentum', default=0.997, type=float, help='Batch normalization momentum')
     parser.add_argument('--batch_norm_eps', default=1e-5, type=float, help='Batch normalization epsilon')
     parser.add_argument('--load_checkpoint', default='', type=str, help='Reload model from checkpoint')
     parser.add_argument('--num_labels', default=10, type=int, help='#classes')
@@ -110,7 +107,8 @@ def reload_checkpoint(path, device=None):
                           weight_decay=args.weight_decay, **optimizer_kwargs)
     scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, args.epochs)
 
-    result = train(net, train_loader, loss=criterion, optimizer=optimizer, scheduler=scheduler, grad_clip=args.grad_clip,
+    result = train(net, train_loader, loss=criterion, optimizer=optimizer, scheduler=scheduler,
+                   grad_clip=args.grad_clip if not args.grad_clip_off else None,
                    num_epochs=args.epochs, num_validation=args.validation_size, validation_loader=valid_loader,
                    device=args.device, print_frequency=args.print_freq)
 
diff --git a/nasbench_pytorch/datasets/cifar10.py b/nasbench_pytorch/datasets/cifar10.py
@@ -28,7 +28,7 @@ def seed_worker(seed, worker_id):
     random.seed(worker_seed)
 
 
-def prepare_dataset(batch_size, test_batch_size=100, root='./data/', use_validation=True, split_from_end=True,
+def prepare_dataset(batch_size, test_batch_size=256, root='./data/', use_validation=True, split_from_end=True,
                     validation_size=10000, random_state=None, set_global_seed=False, no_valid_transform=True,
                     num_workers=0, num_val_workers=0, num_test_workers=0):
     """
diff --git a/nasbench_pytorch/model/model.py b/nasbench_pytorch/model/model.py
@@ -26,7 +26,7 @@
 
 class Network(nn.Module):
     def __init__(self, spec, num_labels=10, in_channels=3, stem_out_channels=128, num_stacks=3, num_modules_per_stack=3,
-                 momentum=0.1, eps=1e-5, tf_like=False):
+                 momentum=0.997, eps=1e-5, tf_like=False):
         """
 
         Args:
diff --git a/nasbench_pytorch/trainer.py b/nasbench_pytorch/trainer.py
@@ -74,7 +74,8 @@ def train(net, train_loader, loss=None, optimizer=None, scheduler=None, grad_cli
             optimizer.zero_grad()
             curr_loss = loss(outputs, targets)
             curr_loss.backward()
-            nn.utils.clip_grad_norm_(net.parameters(), grad_clip)
+            if grad_clip is not None:
+                nn.utils.clip_grad_norm_(net.parameters(), grad_clip)
             optimizer.step()
 
             # metrics
diff --git a/setup.py b/setup.py
@@ -2,8 +2,8 @@
 
 setuptools.setup(
     name='nasbench_pytorch',
-    version='1.2.3',
+    version='1.3',
     license='Apache License 2.0',
-    author='Romulus Hong, Gabriela Suchopárová',
+    author='Romulus Hong, Gabriela Kadlecová',
     packages=setuptools.find_packages()
 )