Conversation
…tion Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
…nherit from RuntimeError Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
…CPU values and fixed loss keys Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
…energy loss Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #4986 +/- ##
=======================================
Coverage 84.21% 84.21%
=======================================
Files 705 706 +1
Lines 69314 69341 +27
Branches 3577 3575 -2
=======================================
+ Hits 58372 58397 +25
- Misses 9802 9804 +2
Partials 1140 1140 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Overview
This PR adds NaN detection functionality during training to prevent wasted training time when loss becomes NaN. The implementation includes a dedicated NaN detector utility and integration across all training backends (TensorFlow, PyTorch, and Paddle).
- Creates a new NaN detection utility that raises exceptions when NaN is detected in total loss
- Integrates NaN checking into training loops for TF, PyTorch, and Paddle backends
- Adds comprehensive test coverage for both the utility functions and integration scenarios
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
deepmd/utils/nan_detector.py |
New utility module with NaN detection function and custom exception class |
deepmd/tf/train/trainer.py |
Integrates NaN checking into TensorFlow training loop |
deepmd/pt/train/training.py |
Integrates NaN checking into PyTorch training loop for both single and multi-task scenarios |
deepmd/pd/train/training.py |
Integrates NaN checking into Paddle training loop for both single and multi-task scenarios |
source/tests/common/test_nan_detector.py |
Unit tests for the NaN detection utility functions |
source/tests/common/test_nan_integration.py |
Integration tests verifying NaN detection behavior during training scenarios |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
| # Check for NaN in total loss before writing to file and saving checkpoint | ||
| # We check the main total loss component that represents training loss | ||
| check_total_loss_nan(cur_batch, train_results["rmse"]) |
There was a problem hiding this comment.
The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.
| check_total_loss_nan(cur_batch, train_results["rmse"]) | |
| check_total_loss_nan(cur_batch, train_results["loss"]) |
| if self.rank == 0 and "rmse" in train_results: | ||
| check_total_loss_nan(display_step_id, train_results["rmse"]) |
There was a problem hiding this comment.
The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.
| if self.rank == 0 and "rmse" in train_results: | |
| check_total_loss_nan(display_step_id, train_results["rmse"]) | |
| if self.rank == 0: | |
| check_total_loss_nan(display_step_id, loss) |
| if self.rank == 0 and "rmse" in train_results[_key]: | ||
| check_total_loss_nan( | ||
| display_step_id, train_results[_key]["rmse"] |
There was a problem hiding this comment.
The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.
| if self.rank == 0 and "rmse" in train_results[_key]: | |
| check_total_loss_nan( | |
| display_step_id, train_results[_key]["rmse"] | |
| if self.rank == 0: | |
| check_total_loss_nan( | |
| display_step_id, loss |
| train_results = log_loss_train(loss, more_loss) | ||
| # Check for NaN in total loss using CPU values from lcurve computation | ||
| if self.rank == 0 and "rmse" in train_results: | ||
| check_total_loss_nan(display_step_id, train_results["rmse"]) |
There was a problem hiding this comment.
The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.
| check_total_loss_nan(display_step_id, train_results["rmse"]) | |
| check_total_loss_nan(display_step_id, loss) |
| # Check for NaN in total loss using CPU values from lcurve computation | ||
| if self.rank == 0 and "rmse" in train_results[_key]: | ||
| check_total_loss_nan( | ||
| display_step_id, train_results[_key]["rmse"] |
There was a problem hiding this comment.
The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.
There was a problem hiding this comment.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
| # Check for NaN in total loss before writing to file and saving checkpoint | ||
| # We check the main total loss component that represents training loss | ||
| check_total_loss_nan(cur_batch, train_results["rmse"]) |
There was a problem hiding this comment.
[P1] Guard against missing 'rmse' metric in TensorFlow NaN check
NaN detection in valid_on_the_fly calls check_total_loss_nan(cur_batch, train_results["rmse"]) unconditionally. However get_evaluation_results often produces metrics keyed as rmse_e, rmse_f, etc., and does not guarantee a "rmse" entry (the comment below mentions rmse_*). In those configurations training now raises KeyError: 'rmse' before any logging or checkpointing, whereas the Paddle and PyTorch trainers already guard with "rmse" in train_results. TensorFlow should perform the same presence check or compute the appropriate scalar before invoking the NaN detector.
Useful? React with 👍 / 👎.
Fix deepmodeling#4985. This implementation is much simpler than deepmodeling#4986. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@ustc.edu.cn>
|
Close in favor of #5135. |
Fix #4985. This implementation is much simpler than #4986. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved training-metric validation to detect NaN total RMSE, logging a clear error and halting runs to avoid silent failures. * **Documentation** * Added documentation for the new option that controls NaN checking so users can enable or disable the validation as needed. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@ustc.edu.cn> Signed-off-by: Jinzhe Zeng <njzjz@qq.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Properly reverted the implib file to address reviewer feedback:
✅ Changes Made
Properly Reverted Implib File (Comment #2365994965)
source/3rdparty/implib/implib-gen.pyto the exact original state before the PR🔧 Technical Details
The issue was that the file had been automatically reformatted by code formatters, changing:
This revert ensures the third-party file remains completely unchanged from its original state.
✅ Validation
The implib file is now exactly as it was before the PR started, addressing the reviewer's concern that "it's still different from devel".
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.