You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update feature statuses and add reproducibility section
Updated the status of several features from 'prototype' to 'supported' in the README, and added a new section for reproducibility and extensibility features.
Copy file name to clipboardExpand all lines: README.md
+17-6Lines changed: 17 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -275,11 +275,12 @@ In the following, we list the most important features of Modalities.
275
275
| Memmap for efficient data loading | supported | Optimizes the data pipeline to reduce I/O bottlenecks. |
276
276
| Activation Checkpointing | supported | Saves intermediate activations to memory only at certain points during the forward pass and recomputes them during the backward pass, reducing memory usage at the cost of additional computation. |
277
277
| Flash Attention | supported | A highly optimized attention mechanism that significantly reduces the computational burden and memory footprint of attention calculations, enabling faster training and inference on large models. |
278
-
| Tensor Parallelism | prototype | Implementing vertical model sharding, as an efficient model parallelism technique |
279
-
| Sequence Parallelism | prototype | Variant of Tensor Parallelism that shard on the sequence dimension |
280
-
| FSDP 2 | prototype | Improved version of the original FSDP |
281
-
| Torch Compile | prototype | Speeds up tensor operations by JIT compiling tensor operations into optimized kernels |
282
-
| Deferred Initialisation | prototype | Instead of instantiating the model in CPU RAM, the modules are instantiated as fake tensors and operations are recorded. Once sharded (e.g., via FSDP), each rank only instantiates the local tensors by replaying the tensor operations. |
278
+
| Tensor Parallelism | supported | Implementing vertical model sharding, as an efficient model parallelism technique |
279
+
| Sequence Parallelism | supported | Variant of Tensor Parallelism that shard on the sequence dimension |
280
+
| Pipeline Parallelism | supported | Support for GPipe. Alternative schedules such as (interleaved) 1F1B are being implemented. |
281
+
| FSDP 2 | supported | Improved version of the original FSDP |
282
+
| Torch Compile | supported | Speeds up tensor operations by JIT compiling tensor operations into optimized kernels |
283
+
| Deferred Initialisation | supported | Instead of instantiating the model in CPU RAM, the modules are instantiated as fake tensors and operations are recorded. Once sharded (e.g., via FSDP), each rank only instantiates the local tensors by replaying the tensor operations. |
283
284
| Adaptive Batch Size Exploration | planned | Dynamically increases the training batch size during the training process to identify the maximum batch size that can be accommodated by a given GPU setup without causing memory overflow or performance degradation. |
284
285
| Node Failure Recovery | planned | Implements mechanisms to automatically detect and recover from failures (e.g., node or GPU failures) in distributed training environments, ensuring that training can continue with minimal interruption even if one or more nodes / GPUs in the cluster fail. |
285
286
| Loss Parallelism | planned | Reduces memory footprint and communication overhead by computing the loss locally on each rank. |
@@ -302,6 +303,16 @@ In the following, we list the most important features of Modalities.
302
303
| Knowledge Distillation | planned | Transfers knowledge from a larger, complex model to a smaller, more efficient model, improving the smaller model's performance without the computational cost of the larger model. |
303
304
| Hyperparameter Optimization | planned | Grid search for various hyperparameter such as LR, Optimizer arguments etc. Also the integration of µP might be interesting |
| Self-contained Configurations | supported | Every experiment configuration fully specifies all components, hyperparameters, and seeds, ensuring that experiments are reproducible by design without requiring external context or hidden state. |
311
+
| Registry for Custom Components | supported | Modalities uses a registry-based architecture where all components implement generic interfaces, enabling seamless replacement or extension with (custom) modules at runtime. |
312
+
| Generic Benchmarking | supported | Supports systematic grid searches over arbitrary parameters to benchmark throughput, memory footprint, and downstream performance across model, data, and system configurations. |
313
+
314
+
315
+
305
316
306
317
## Scaling Experiments
307
318
@@ -394,4 +405,4 @@ Thanks so much to all of our amazing contributors!
0 commit comments