feat: add NaN detection during training by Copilot · Pull Request #4986 · deepmodeling/deepmd-kit

Copilot · 2025-09-20T13:26:36Z

Properly reverted the implib file to address reviewer feedback:

✅ Changes Made

Properly Reverted Implib File (Comment #2365994965)

Correctly reverted source/3rdparty/implib/implib-gen.py to the exact original state before the PR
Previous revert was incomplete - file was still reformatted (indentation, quotes, etc.)
Now matches the exact original state from the base branch

🔧 Technical Details

The issue was that the file had been automatically reformatted by code formatters, changing:

Indentation style (spaces to 4-space indentation)
Quote style (single to double quotes)
Code formatting (spacing, line breaks)

This revert ensures the third-party file remains completely unchanged from its original state.

✅ Validation

All 11 tests still pass
Basic functionality verified
Code linting passes (excluding the implib file)
Third-party file is now properly preserved

The implib file is now exactly as it was before the PR started, addressing the reviewer's concern that "it's still different from devel".

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…tion Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

…nherit from RuntimeError Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

…CPU values and fixed loss keys Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

…energy loss Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

codecov · 2025-09-20T19:10:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.21%. Comparing base (e49741c) to head (0bebb06).
⚠️ Report is 225 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #4986   +/-   ##
=======================================
  Coverage   84.21%   84.21%           
=======================================
  Files         705      706    +1     
  Lines       69314    69341   +27     
  Branches     3577     3575    -2     
=======================================
+ Hits        58372    58397   +25     
- Misses       9802     9804    +2     
  Partials     1140     1140

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

Copilot

Pull Request Overview

This PR adds NaN detection functionality during training to prevent wasted training time when loss becomes NaN. The implementation includes a dedicated NaN detector utility and integration across all training backends (TensorFlow, PyTorch, and Paddle).

Creates a new NaN detection utility that raises exceptions when NaN is detected in total loss
Integrates NaN checking into training loops for TF, PyTorch, and Paddle backends
Adds comprehensive test coverage for both the utility functions and integration scenarios

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`deepmd/utils/nan_detector.py`	New utility module with NaN detection function and custom exception class
`deepmd/tf/train/trainer.py`	Integrates NaN checking into TensorFlow training loop
`deepmd/pt/train/training.py`	Integrates NaN checking into PyTorch training loop for both single and multi-task scenarios
`deepmd/pd/train/training.py`	Integrates NaN checking into Paddle training loop for both single and multi-task scenarios
`source/tests/common/test_nan_detector.py`	Unit tests for the NaN detection utility functions
`source/tests/common/test_nan_integration.py`	Integration tests verifying NaN detection behavior during training scenarios

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-22T11:12:59Z

+
+        # Check for NaN in total loss before writing to file and saving checkpoint
+        # We check the main total loss component that represents training loss
+        check_total_loss_nan(cur_batch, train_results["rmse"])


The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change

check_total_loss_nan(cur_batch, train_results["rmse"])

check_total_loss_nan(cur_batch, train_results["loss"])

Copilot · 2025-09-22T11:13:00Z

+                    if self.rank == 0 and "rmse" in train_results:
+                        check_total_loss_nan(display_step_id, train_results["rmse"])


The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change

if self.rank == 0 and "rmse" in train_results:

check_total_loss_nan(display_step_id, train_results["rmse"])

if self.rank == 0:

check_total_loss_nan(display_step_id, loss)

Copilot · 2025-09-22T11:13:00Z

+                        if self.rank == 0 and "rmse" in train_results[_key]:
+                            check_total_loss_nan(
+                                display_step_id, train_results[_key]["rmse"]


The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change

if self.rank == 0 and "rmse" in train_results[_key]:

check_total_loss_nan(

display_step_id, train_results[_key]["rmse"]

if self.rank == 0:

check_total_loss_nan(

display_step_id, loss

Copilot · 2025-09-22T11:13:00Z

                    train_results = log_loss_train(loss, more_loss)
+                    # Check for NaN in total loss using CPU values from lcurve computation
+                    if self.rank == 0 and "rmse" in train_results:
+                        check_total_loss_nan(display_step_id, train_results["rmse"])


The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

Suggested change

check_total_loss_nan(display_step_id, train_results["rmse"])

check_total_loss_nan(display_step_id, loss)

Copilot · 2025-09-22T11:13:01Z

+                        # Check for NaN in total loss using CPU values from lcurve computation
+                        if self.rank == 0 and "rmse" in train_results[_key]:
+                            check_total_loss_nan(
+                                display_step_id, train_results[_key]["rmse"]


The function is checking 'rmse' which represents root mean square error, not total loss. This could miss NaN in the actual total loss while falsely triggering on RMSE calculations. Consider using the actual total loss value instead of RMSE.

chatgpt-codex-connector

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

chatgpt-codex-connector · 2025-09-22T11:14:59Z

+        # Check for NaN in total loss before writing to file and saving checkpoint
+        # We check the main total loss component that represents training loss
+        check_total_loss_nan(cur_batch, train_results["rmse"])


[P1] Guard against missing 'rmse' metric in TensorFlow NaN check

NaN detection in valid_on_the_fly calls check_total_loss_nan(cur_batch, train_results["rmse"]) unconditionally. However get_evaluation_results often produces metrics keyed as rmse_e, rmse_f, etc., and does not guarantee a "rmse" entry (the comment below mentions rmse_*). In those configurations training now raises KeyError: 'rmse' before any logging or checkpointing, whereas the Paddle and PyTorch trainers already guard with "rmse" in train_results. TensorFlow should perform the same presence check or compute the appropriate scalar before invoking the NaN detector.

Useful? React with 👍 / 👎.

Fix deepmodeling#4985. This implementation is much simpler than deepmodeling#4986. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@ustc.edu.cn>

njzjz · 2026-01-07T17:29:02Z

Close in favor of #5135.

Fix #4985. This implementation is much simpler than #4986.  ## Summary by CodeRabbit * **Bug Fixes** * Improved training-metric validation to detect NaN total RMSE, logging a clear error and halting runs to avoid silent failures. * **Documentation** * Added documentation for the new option that controls NaN checking so users can enable or disable the validation as needed. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>  --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@ustc.edu.cn> Signed-off-by: Jinzhe Zeng <njzjz@qq.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Initial plan

ef431a1

Copilot AI assigned Copilot and njzjz Sep 20, 2025

Copilot started work on behalf of njzjz September 20, 2025 13:26 View session

feat(training): add comprehensive NaN detection with tests and valida…

9eb1bea

…tion Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] [Feature Request] throw an error when the total loss is NaN~~ feat(training): add NaN detection in loss to prevent corrupted training Sep 20, 2025

Copilot finished work on behalf of njzjz September 20, 2025 13:43

Copilot AI requested a review from njzjz September 20, 2025 13:43

njzjz requested changes Sep 20, 2025

View reviewed changes

Copilot started work on behalf of njzjz September 20, 2025 16:18 View session

fix(training): address PR feedback - simplify NaN detection API and i…

5a22dfc

…nherit from RuntimeError Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

Copilot AI changed the title ~~feat(training): add NaN detection in loss to prevent corrupted training~~ feat(training): add NaN detection in total loss to prevent corrupted training Sep 20, 2025

Copilot AI requested a review from njzjz September 20, 2025 16:39

Copilot stopped work on behalf of njzjz due to an error September 20, 2025 16:39
Copilot has encountered an error. See logs for additional details.

njzjz requested changes Sep 20, 2025

View reviewed changes

Comment thread deepmd/tf/train/trainer.py Outdated

Comment thread deepmd/tf/train/trainer.py Outdated

Comment thread deepmd/tf/train/trainer.py Outdated

Comment thread deepmd/pt/train/training.py Outdated

Copilot started work on behalf of njzjz September 20, 2025 17:17 View session

fix(training): optimize NaN detection based on feedback - use lcurve …

0852b7c

…CPU values and fixed loss keys Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

Copilot AI requested a review from njzjz September 20, 2025 17:29

Copilot finished work on behalf of njzjz September 20, 2025 17:29

Copilot started work on behalf of njzjz September 20, 2025 17:59 View session

fix(training): use 'rmse' key for total loss instead of 'rmse_e' for …

7a2b41e

…energy loss Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

Copilot finished work on behalf of njzjz September 20, 2025 18:09

github-actions Bot added the Python label Sep 20, 2025

github-advanced-security AI found potential problems Sep 20, 2025

View reviewed changes

Comment thread source/3rdparty/implib/implib-gen.py Fixed

Comment thread source/3rdparty/implib/implib-gen.py Fixed

Comment thread source/3rdparty/implib/implib-gen.py Fixed

Comment thread source/3rdparty/implib/implib-gen.py Fixed

njzjz reviewed Sep 21, 2025

View reviewed changes

Comment thread source/3rdparty/implib/implib-gen.py

Comment thread source/tests/common/test_nan_detector.py Outdated

Comment thread source/tests/common/test_nan_integration.py Outdated

Copilot started work on behalf of njzjz September 21, 2025 07:45 View session

fix: revert implib file and clean up redundant test code

22cb9ef

Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

Copilot AI requested a review from njzjz September 21, 2025 07:53

Copilot finished work on behalf of njzjz September 21, 2025 07:53

Copilot started work on behalf of njzjz September 21, 2025 08:07 View session

fix: properly revert implib file to exact original state

0bebb06

Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>

Copilot stopped work on behalf of njzjz due to an error September 21, 2025 08:14
Copilot has encountered an error. See logs for additional details.

njzjz changed the title ~~feat(training): add NaN detection in total loss to prevent corrupted training~~ feat: add NaN detection during training Sep 22, 2025

njzjz marked this pull request as ready for review September 22, 2025 11:11

Copilot AI review requested due to automatic review settings September 22, 2025 11:11

Copilot AI reviewed Sep 22, 2025

View reviewed changes

chatgpt-codex-connector Bot reviewed Sep 22, 2025

View reviewed changes

njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Jan 7, 2026

feat: add NaN detection during training

6e8bca5

Fix deepmodeling#4985. This implementation is much simpler than deepmodeling#4986. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@ustc.edu.cn>

njzjz mentioned this pull request Jan 7, 2026

feat: add NaN detection during training #5135

Merged

njzjz closed this Jan 7, 2026

njzjz deleted the copilot/fix-4985 branch January 7, 2026 17:29

	check_total_loss_nan(cur_batch, train_results["rmse"])
	check_total_loss_nan(cur_batch, train_results["loss"])

		if self.rank == 0 and "rmse" in train_results:
		check_total_loss_nan(display_step_id, train_results["rmse"])

	check_total_loss_nan(display_step_id, train_results["rmse"])
	check_total_loss_nan(display_step_id, loss)

Conversation

Copilot AI commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Changes Made

🔧 Technical Details

✅ Validation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

njzjz commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Sep 20, 2025 •

edited

Loading

codecov Bot commented Sep 20, 2025 •

edited

Loading