[2026-03-06 08:56:47,366][osuT5.utils.train_utils][INFO] - {'train/loss': 0.10812649875879288, 'train/grad_l2':
0.23848098367452622, 'train/weights_l2': 323.96180091922207, 'train/lr': 0.0010721064521583958,
'train/seconds_per_step': 1.6197779893875122, 'train/epoch': 97, 'step': 5000}
[2026-03-06 08:56:47,368][accelerate.accelerator][INFO] - Saving current state to checkpoint-5000
[2026-03-06 08:56:48,079][accelerate.checkpointing][INFO] - Model weights saved in
checkpoint-5000/pytorch_model.bin
[2026-03-06 08:56:48,161][accelerate.checkpointing][INFO] - Optimizer state saved in
checkpoint-5000/optimizer.bin
[2026-03-06 08:56:48,162][accelerate.checkpointing][INFO] - Scheduler state saved in
checkpoint-5000/scheduler.bin
[2026-03-06 08:56:48,163][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in
checkpoint-5000/sampler.bin
[2026-03-06 08:56:48,164][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in
checkpoint-5000/sampler_1.bin
[2026-03-06 08:56:48,165][accelerate.checkpointing][INFO] - Random states saved in
checkpoint-5000/random_states_0.pkl
[2026-03-06 08:56:48,166][accelerate.checkpointing][INFO] - Saving the state of Tokenizer to
checkpoint-5000/custom_checkpoint_0.pkl
Error executing job with overrides: ['data.train_dataset_path=/test/', 'data.test_dataset_path=/test/', 'compile=false',
'data.train_dataset_start=0', 'data.train_dataset_end=66', 'data.test_dataset_start=0',
'data.test_dataset_end=0', 'optim.base_lr=0.0025', 'optim.base_lr_2=0.0005', 'optim.batch_size=8',
'optim.grad_acc=1', 'optim.total_steps=9000', 'optim.warmup_steps=200']
Traceback (most recent call last):
File "/workspace/Mapperatorinator/osuT5/train.py", line 111, in main
func(
File "/workspace/Mapperatorinator/osuT5/osuT5/utils/train_utils.py", line 377, in train
maybe_save_checkpoint(model, accelerator, args, shared)
File "/workspace/Mapperatorinator/osuT5/osuT5/utils/train_utils.py", line 70, in maybe_save_checkpoint
art = wandb.Artifact(
^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/wandb/sdk/artifacts/artifact.py", line 217, in __init__
self._metadata: dict[str, Any] = validate_metadata(metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/functools.py", line 909, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/wandb/sdk/artifacts/_validators.py", line 218, in _
metadata = from_json(json.dumps(json_friendly_val(metadata), allow_nan=False))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/json/__init__.py", line 238, in dumps
**kw).encode(obj)
^^^^^^^^^^^
File "/opt/conda/lib/python3.11/json/encoder.py", line 200, in encode
chunks = self.iterencode(o, _one_shot=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/json/encoder.py", line 258, in iterencode
return _iterencode(o, 0)
^^^^^^^^^^^^^^^^^
ValueError: Out of range float values are not JSON compliant
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
wandb:
wandb: 🚀 View run wandering-wood-117 at:
wandb: Find logs at: ...
Traceback:
Workaround:
Simply resume the training