Commit 63eeb11
fix: Validate fp16.loss_scale is finite and non-negative (#7889)
Validate fp16.loss_scale is finite and non-negative
Add a Pydantic field validator to DeepSpeedFP16Config to reject
NaN/inf/-inf and negative values for fp16.loss_scale (while keeping 0 as
dynamic loss scaling). This prevents invalid configs from silently
initializing and causing NaNs during training.
Test:
Run pytest -q tests/unit/runtime/test_precision_config_loss_scale.py
Result:
```
root@72170d0458e9:/home/DeepSpeed_woo# pytest -q tests/unit/runtime/test_precision_config_loss_scale.py
=================================================================== test session starts ===================================================================
platform linux -- Python 3.11.10, pytest-8.3.5, pluggy-1.6.0 -- /usr/bin/python
cachedir: .pytest_cache
Using --randomly-seed=1526199052
rootdir: /home/DeepSpeed_woo/tests
configfile: pytest.ini
plugins: xdist-3.8.0, randomly-4.0.1, forked-1.6.0, anyio-4.6.0
collected 10 items
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[3] PASSED [ 10%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[0] PASSED [ 20%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[inf] PASSED [ 30%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[1] PASSED [ 40%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[nan] PASSED [ 50%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_accepts_valid_values[2.0] PASSED [ 60%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[True] PASSED [ 70%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale0] PASSED [ 80%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_rejects_invalid_values[-1] PASSED [ 90%]
tests/unit/runtime/test_precision_config_loss_scale.py::test_fp16_loss_scale_invalid_type_has_clear_error[loss_scale1] PASSED [100%]
(30 durations < 1s hidden. Use -vv to show these durations.)
============================================================= 10 passed, 16 warnings in 4.18s =============================================================
```
Fix issue #7852
---------
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>1 parent 784cc26 commit 63eeb11
2 files changed
Lines changed: 48 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
6 | 8 | | |
7 | 9 | | |
8 | 10 | | |
| |||
113 | 115 | | |
114 | 116 | | |
115 | 117 | | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
116 | 133 | | |
117 | 134 | | |
118 | 135 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
0 commit comments