Skip to content

Commit e462f57

Browse files
authored
Merge pull request #425 from Modalities/monitoring_improvements
Monitoring improvements
2 parents c775efa + 4234f55 commit e462f57

86 files changed

Lines changed: 1888 additions & 713 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,4 +172,7 @@ config_files/instruction_tuning
172172
data/lorem_ipsum_instruct.jsonl
173173
tutorials/scaling_up/logs*
174174
tutorials/scaling_up/experiments_old/*
175-
results/*
175+
results/*
176+
tutorials/einsum_transformer/experiments/*
177+
tutorials/warmstart/experiments/*
178+

README.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121

2222
Modalities is a PyTorch-native framework for distributed training of Large Language Models (LLMs) and Foundation Models (FMs) at scale. Given the complexity of distributed training and rapid advancements in the field, we aim to provide a flexible and easy-to-use framework that enables researchers and practitioners to train and evaluate LLMs and FMs efficiently. Modalities is built on top of PyTorch and leverages the latest advancements in distributed training, such as Fully Sharded Data Parallel (FSDP), mixed precision training, Flash Attention and many more, to achieve state-of-the-art performance and throughput.
2323

24+
For a technical report on the archictecture and latest benchmarks, check out our [Modalities pre-print](https://arxiv.org/abs/2602.08387).
25+
2426
We successfully scaled Modalities up to 2048 GPUs on two HPC centers, namely [Leonardo Booster](https://leonardo-supercomputer.cineca.eu/hpc-system/) and [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), featuring Nvidia A100 and H100 GPUs, respectively. The results of our scaling experiments can be found [here](#scaling-experiments).
2527

2628
Besides its scalabilty, Modalities allows to seamlessly integrate new components and features, such as custom attention mechanisms, loss functions, optimizers or models. We provide a series of tutorials to help you get started with training and evaluating models using Modalities. We achieve this level of extensibility by having clear interfaces for each component type (e.g., model, optimizer, etc.), that a component must implement to be registered within Modalities at runtime.
@@ -277,7 +279,7 @@ In the following, we list the most important features of Modalities.
277279
| Flash Attention | supported | A highly optimized attention mechanism that significantly reduces the computational burden and memory footprint of attention calculations, enabling faster training and inference on large models. |
278280
| Tensor Parallelism | supported | Implementing vertical model sharding, as an efficient model parallelism technique |
279281
| Sequence Parallelism | supported | Variant of Tensor Parallelism that shard on the sequence dimension |
280-
| Pipeline Parallelism | supported | Support for GPipe. Alternative schedules such as (interleaved) 1F1B are being implemented. |
282+
| Pipeline Parallelism | supported | Beta-level support for schedules such as GPipe, (interleaved) 1F1B and DualPipe. |
281283
| FSDP 2 | supported | Improved version of the original FSDP |
282284
| Torch Compile | supported | Speeds up tensor operations by JIT compiling tensor operations into optimized kernels |
283285
| Deferred Initialisation | supported | Instead of instantiating the model in CPU RAM, the modules are instantiated as fake tensors and operations are recorded. Once sharded (e.g., via FSDP), each rank only instantiates the local tensors by replaying the tensor operations. |
@@ -390,19 +392,22 @@ Further scaling results can be found at [MareNostrum5 Scaling Experiments](https
390392
Modalities welcomes your contributions! Please check out our
391393
[contributing](CONTRIBUTING.md) guidelines regarding the details on formatting, testing,
392394
etc.<br/><br/><br/>
393-
Thanks so much to all of our amazing contributors!
395+
Thanks so much to all of our contributors and collaborators!
394396

395397
<a href="https://github.com/modalities/modalities/graphs/contributors">
396398
<img src="https://contrib.rocks/image?repo=modalities/modalities&r=" width="800px"/>
397399
</a>
398400

399401
## Citation
400402

401-
@misc{modalities,
402-
title={Modalities: A PyTorch-native framework for distributed and reproducible foundation model training.},
403-
author={Lübbering, Max and Ali, Mehdi and Stollenwerk, Felix and Fromm, Michael and Weber, Alexander Arno and Rutmann, Richard},
404-
year={2024},
405-
howpublished={\url{https://github.com/Modalities/modalities}},
406-
url="https://github.com/Modalities/modalities",
407-
}
408-
403+
```
404+
@misc{luebbering2026modalitiespytorchnativeframeworklargescale,
405+
title={Modalities, a PyTorch-native Framework For Large-scale LLM Training and Research},
406+
author={Max Lübbering and Timm Ruland and Richard Rutmann and Felix Stollenwerk and David Fitzek and Michael Fromm and Alexander Weber and Rafet Sifa and Nicolas Flores-Herr and Joachim Köhler and Mehdi Ali},
407+
year={2026},
408+
eprint={2602.08387},
409+
archivePrefix={arXiv},
410+
primaryClass={cs.LG},
411+
url={https://arxiv.org/abs/2602.08387},
412+
}
413+
```

config_files/training/config_lorem_ipsum_long_fsdp2.yaml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ train_dataset:
7171
config:
7272
raw_data_path: ${settings.paths.train_dataset_path}
7373
sequence_length: ${settings.step_profile.sequence_length}
74-
sample_key: ${settings.referencing_keys.sample_key}
74+
sample_key: ${settings.referencing_keys.sample_key}
7575

7676
train_dataloader:
7777
component_key: data_loader
@@ -195,7 +195,7 @@ app_state:
195195
component_key: app_state
196196
variant_key: raw
197197
config:
198-
model:
198+
model:
199199
instance_key: initialized_model
200200
pass_type: BY_REFERENCE
201201
optimizer:
@@ -305,7 +305,7 @@ optimizer:
305305
eps: 1e-8
306306
weight_decay: 1e-1
307307
weight_decay_groups_excluded: [embedding, layernorm]
308-
wrapped_model:
308+
wrapped_model:
309309
instance_key: initialized_model
310310
pass_type: BY_REFERENCE
311311

@@ -318,6 +318,9 @@ gradient_clipper:
318318
pass_type: BY_REFERENCE
319319
norm_type: P2_NORM
320320
max_norm: 1.0
321+
device_mesh:
322+
instance_key: device_mesh
323+
pass_type: BY_REFERENCE
321324

322325
progress_subscriber:
323326
component_key: progress_subscriber

src/modalities/__main__.py

Lines changed: 70 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -54,10 +54,10 @@ def main() -> None:
5454
help="Path to the YAML training config file.",
5555
)
5656
@click.option(
57-
"--test_comm",
58-
is_flag=True,
59-
default=False,
60-
help="If set, run a communication test before training.",
57+
"--experiments_root_path",
58+
type=click_pathlib.Path(exists=True),
59+
required=True,
60+
help="Path to the root directory where experiment folders will be created.",
6161
)
6262
@click.option(
6363
"--experiment_id",
@@ -71,61 +71,51 @@ def main() -> None:
7171
default=None,
7272
help="Optional path to a folder where error logs will be written.",
7373
)
74+
@click.option(
75+
"--test_comm",
76+
is_flag=True,
77+
default=False,
78+
help="If set, run a communication test before training.",
79+
)
7480
def CMD_entry_point_run_modalities(
7581
config_file_path: Path,
76-
test_comm: bool = False,
82+
experiments_root_path: Path,
7783
experiment_id: Optional[str] = None,
7884
error_log_folder: Optional[Path] = None,
85+
test_comm: bool = False,
7986
):
8087
"""Entrypoint to run the model training.
8188
8289
Args:
8390
config_file_path (Path): Path to the YAML training config file.
84-
test_comm (bool): If set, run a communication test before training.
91+
experiments_root_path (Path): Path to the root directory where experiment folders will be created.
8592
experiment_id (Optional[str]): Optional experiment ID to use for this run.
8693
If not provided it will be generated. Default is None.
8794
error_log_folder (Optional[Path]): Optional path to a folder where error logs will be written.
95+
test_comm (bool): If set, run a communication test before training.
8896
"""
8997

90-
def _format_exception_as_json(e: Exception, environment: dict[str, Any]) -> str:
91-
# Format an exception into a structured JSON string with error message, type, and stack trace.
92-
error = {
93-
"error": str(e),
94-
"type": type(e).__name__,
95-
"stacktrace": traceback.format_exception(type(e), e, e.__traceback__),
96-
}
97-
98-
return json.dumps({"environment": environment, "error": error}, indent=2)
99-
10098
try:
10199
with CudaEnv(process_group_backend=ProcessGroupBackendType.nccl):
102100
if test_comm:
103101
print_rank_0("Running communication test...")
104102
run_communication_test()
105103
print_rank_0("Communication test succeeded.")
106104

107-
main_obj = Main(config_file_path, experiment_id=experiment_id)
105+
main_obj = Main(config_file_path, experiments_root_path=experiments_root_path, experiment_id=experiment_id)
108106
components = main_obj.build_components(components_model_type=TrainingComponentsInstantiationModel)
109107
main_obj.run(components)
110108
except Exception as e:
111-
if error_log_folder is not None:
112-
environment = {
113-
"rank": int(os.environ["RANK"] if "RANK" in os.environ else -1),
114-
"local_rank": int(os.environ["LOCAL_RANK"] if "LOCAL_RANK" in os.environ else -1),
115-
"world_size": int(os.environ["WORLD_SIZE"] if "WORLD_SIZE" in os.environ else -1),
116-
"hostname": socket.gethostname(),
117-
}
118-
error_log_folder = (
119-
error_log_folder / f"error_logs_{environment['hostname']}_{environment['local_rank']}.log"
120-
)
121-
error_log_folder.parent.mkdir(parents=True, exist_ok=True)
122-
with open(error_log_folder, "w", encoding="utf-8") as f:
123-
f.write(_format_exception_as_json(e, environment))
124-
125-
raise RuntimeError(f"An error occurred while running the training: {e}. ") from e
109+
_exception_handling(e, error_log_folder)
126110

127111

128112
@main.command(name="warmstart")
113+
@click.option(
114+
"--experiments_root_path",
115+
type=click_pathlib.Path(exists=True),
116+
required=True,
117+
help="Path to the root directory where experiment folders will be created.",
118+
)
129119
@click.option(
130120
"--config_file_path",
131121
type=click_pathlib.Path(exists=True),
@@ -138,10 +128,22 @@ def _format_exception_as_json(e: Exception, environment: dict[str, Any]) -> str:
138128
required=True,
139129
help="Path to the file containing the model and optimizer checkpoint paths from the last successful checkpoint.",
140130
)
141-
def CMD_entry_point_warmstart_modalities(config_file_path: Path, last_checkpoint_info_file_path: Path):
131+
@click.option(
132+
"--error_log_folder",
133+
type=click_pathlib.Path(),
134+
default=None,
135+
help="Optional path to a folder where error logs will be written.",
136+
)
137+
def CMD_entry_point_warmstart_modalities(
138+
experiments_root_path: Path,
139+
config_file_path: Path,
140+
last_checkpoint_info_file_path: Path,
141+
error_log_folder: Optional[Path] = None,
142+
):
142143
"""Entrypoint to run the model warmstart.
143144
144145
Args:
146+
experiments_root_path (Path): Path to the root directory where experiment folders will be created.
145147
config_file_path (Path): Path to the YAML warmstart config file.
146148
last_checkpoint_info_file_path (Path): Path to the file containing the model and
147149
optimizer checkpoint paths from the last successful checkpoint.
@@ -159,10 +161,15 @@ def get_last_checkpoint_resolver_fun(var_name: str, last_checkpoint_info_file_pa
159161
get_last_checkpoint_resolver_fun, last_checkpoint_info_file_path=last_checkpoint_info_file_path
160162
)
161163
}
162-
with CudaEnv(process_group_backend=ProcessGroupBackendType.nccl):
163-
main_obj = Main(config_file_path, additional_resolver_funs=resolver_funs)
164-
components = main_obj.build_components(components_model_type=TrainingComponentsInstantiationModel)
165-
main_obj.run(components)
164+
try:
165+
with CudaEnv(process_group_backend=ProcessGroupBackendType.nccl):
166+
main_obj = Main(
167+
config_file_path, experiments_root_path=experiments_root_path, additional_resolver_funs=resolver_funs
168+
)
169+
components = main_obj.build_components(components_model_type=TrainingComponentsInstantiationModel)
170+
main_obj.run(components)
171+
except Exception as e:
172+
_exception_handling(e, error_log_folder)
166173

167174

168175
@main.command(name="generate_text")
@@ -705,54 +712,42 @@ def profile():
705712
required=True,
706713
help="Path to the experiment output directory.",
707714
)
708-
@click.option(
709-
"--num_wait_steps",
710-
type=int,
711-
default=1,
712-
show_default=True,
713-
help="Number of wait steps to skip in profiling.",
714-
)
715-
@click.option(
716-
"--num_warmup_steps",
717-
type=int,
718-
default=1,
719-
show_default=True,
720-
help="Number of warmup steps to skip in profiling. Already recording but dropping the data.",
721-
)
722-
@click.option(
723-
"--num_measurement_steps",
724-
type=int,
725-
default=3,
726-
show_default=True,
727-
help="Number of steps to measure during profiling.",
728-
)
729-
@click.option(
730-
"--profiled_ranks",
731-
type=str,
732-
default="0",
733-
help="Comma-separated list of profiled ranks (must not have spaces), e.g. --profiled_ranks '2,4,8'",
734-
)
735715
def CMD_entry_point_run_train_step_profiler(
736716
config_file_path: Path,
737717
experiment_root_path: Path,
738-
num_wait_steps: int,
739-
num_warmup_steps: int,
740-
num_measurement_steps: int,
741-
profiled_ranks: str,
742718
):
743719
"""Run train step profiler and write result to JSON if RANK=0."""
744-
profiled_ranks_list = [int(i) for i in profiled_ranks.split(",")] if profiled_ranks != "" else [0]
745-
logger.info(f"Running distributed profiling on ranks {profiled_ranks_list}")
746-
747720
ModalitiesProfilerStarter.run_distributed(
748721
config_file_path=config_file_path,
749-
num_measurement_steps=num_measurement_steps,
750-
num_wait_steps=num_wait_steps,
751-
num_warmup_steps=num_warmup_steps,
752722
experiment_root_path=experiment_root_path,
753-
profiled_ranks=profiled_ranks_list,
754723
)
755724

756725

726+
def _format_exception_as_json(e: Exception, environment: dict[str, Any]) -> str:
727+
# Format an exception into a structured JSON string with error message, type, and stack trace.
728+
error = {
729+
"error": str(e),
730+
"type": type(e).__name__,
731+
"stacktrace": traceback.format_exception(type(e), e, e.__traceback__),
732+
}
733+
return json.dumps({"environment": environment, "error": error}, indent=2)
734+
735+
736+
def _exception_handling(e: Exception, error_log_folder: Path | None):
737+
if error_log_folder is not None:
738+
environment = {
739+
"rank": int(os.environ["RANK"] if "RANK" in os.environ else -1),
740+
"local_rank": int(os.environ["LOCAL_RANK"] if "LOCAL_RANK" in os.environ else -1),
741+
"world_size": int(os.environ["WORLD_SIZE"] if "WORLD_SIZE" in os.environ else -1),
742+
"hostname": socket.gethostname(),
743+
}
744+
error_log_folder = error_log_folder / f"error_logs_{environment['hostname']}_{environment['local_rank']}.log"
745+
error_log_folder.parent.mkdir(parents=True, exist_ok=True)
746+
with open(error_log_folder, "w", encoding="utf-8") as f:
747+
f.write(_format_exception_as_json(e, environment))
748+
749+
raise RuntimeError(f"An error occurred while running the training: {e}. ") from e
750+
751+
757752
if __name__ == "__main__":
758753
main()

src/modalities/checkpointing/fsdp/fsdp_checkpoint_saving.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ def _get_checkpointing_path(
8080
num_target_tokens=str(num_target_tokens),
8181
)
8282

83-
full_path = Path(self.checkpoint_path, experiment_id, entity_file_name)
83+
full_path = Path(self.checkpoint_path, entity_file_name)
8484
return full_path
8585

8686
@torch.no_grad()
@@ -224,7 +224,7 @@ def _get_checkpointing_folder_path(
224224
num_target_steps=str(num_target_steps),
225225
num_target_tokens=str(num_target_tokens),
226226
)
227-
full_path = Path(self.checkpoint_path, experiment_id, entity_file_name)
227+
full_path = Path(self.checkpoint_path, entity_file_name)
228228
return full_path
229229

230230
@torch.no_grad()

0 commit comments

Comments
 (0)