Commit 813bbc8
refactor(training): Average training loss for smoother and more representative logging (#4850)
This pull request modifies the training loop to improve the quality and
readability of the reported training loss.
## Summary of Changes
Previously, the training loss and associated metrics (e.g., rmse_e_trn)
reported in lcurve.out and the console log at each disp_freq step
represented the instantaneous value from that single training batch.
This could be quite noisy and subject to high variance depending on the
specific batch sampled.
This PR introduces an accumulator for the training loss. The key changes
are:
During each training step, the loss values are accumulated.
When a display step is reached, the accumulated values are averaged over
the number of steps in that interval.
This averaged loss is then reported in the log and lcurve.out.
The accumulators are reset for the next interval.
The validation logic remains unchanged, continuing to provide a periodic
snapshot of model performance, which is the standard and efficient
approach.
## Significance and Benefits
Reporting the averaged training loss provides a much smoother and more
representative training curve. The benefits include:
Reduced Noise: Eliminates high-frequency fluctuations, making it easier
to see the true learning trend.
Improved Readability: Plotted learning curves from lcurve.out are
cleaner and more interpretable.
Better Comparability: Simplifies the comparison of model performance
across different training runs, as the impact of single-batch anomalies
is minimized.
## A Note on Formatting
Please note that due to automatic code formatters (e.g., black, isort),
some minor, purely stylistic changes may appear in the diff that are not
directly related to the core logic.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Training loss values displayed during training are now averaged over
the display interval, providing more stable and representative loss
metrics for both single-task and multi-task modes.
* Added an option to enable or disable averaging of training loss
display via a new configuration setting.
* **Improvements**
* Enhanced training loss reporting for improved monitoring and analysis
during model training.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: LI TIANCHENG <137472077+OutisLi@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jinzhe Zeng <njzjz@qq.com>1 parent 1dc1248 commit 813bbc8
2 files changed
+121
-25
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
140 | 140 | | |
141 | 141 | | |
142 | 142 | | |
| 143 | + | |
143 | 144 | | |
144 | 145 | | |
145 | 146 | | |
| |||
808 | 809 | | |
809 | 810 | | |
810 | 811 | | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
811 | 839 | | |
812 | 840 | | |
813 | 841 | | |
814 | 842 | | |
815 | 843 | | |
816 | 844 | | |
817 | 845 | | |
818 | | - | |
819 | | - | |
820 | | - | |
821 | | - | |
822 | | - | |
823 | | - | |
824 | | - | |
825 | | - | |
826 | | - | |
827 | | - | |
| 846 | + | |
| 847 | + | |
| 848 | + | |
| 849 | + | |
| 850 | + | |
| 851 | + | |
| 852 | + | |
| 853 | + | |
| 854 | + | |
| 855 | + | |
| 856 | + | |
| 857 | + | |
| 858 | + | |
| 859 | + | |
| 860 | + | |
| 861 | + | |
| 862 | + | |
| 863 | + | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
| 868 | + | |
| 869 | + | |
| 870 | + | |
| 871 | + | |
| 872 | + | |
| 873 | + | |
| 874 | + | |
| 875 | + | |
| 876 | + | |
| 877 | + | |
| 878 | + | |
| 879 | + | |
| 880 | + | |
828 | 881 | | |
829 | 882 | | |
830 | 883 | | |
| |||
882 | 935 | | |
883 | 936 | | |
884 | 937 | | |
885 | | - | |
886 | | - | |
887 | | - | |
888 | | - | |
889 | | - | |
890 | | - | |
891 | | - | |
892 | | - | |
893 | | - | |
894 | | - | |
895 | | - | |
896 | | - | |
897 | | - | |
898 | | - | |
899 | | - | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
900 | 941 | | |
901 | 942 | | |
902 | 943 | | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
903 | 963 | | |
904 | 964 | | |
905 | 965 | | |
| |||
921 | 981 | | |
922 | 982 | | |
923 | 983 | | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
924 | 999 | | |
925 | 1000 | | |
926 | 1001 | | |
| |||
993 | 1068 | | |
994 | 1069 | | |
995 | 1070 | | |
| 1071 | + | |
| 1072 | + | |
| 1073 | + | |
| 1074 | + | |
| 1075 | + | |
| 1076 | + | |
| 1077 | + | |
| 1078 | + | |
| 1079 | + | |
| 1080 | + | |
| 1081 | + | |
996 | 1082 | | |
997 | 1083 | | |
998 | 1084 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3137 | 3137 | | |
3138 | 3138 | | |
3139 | 3139 | | |
| 3140 | + | |
| 3141 | + | |
| 3142 | + | |
3140 | 3143 | | |
3141 | 3144 | | |
3142 | 3145 | | |
| |||
3213 | 3216 | | |
3214 | 3217 | | |
3215 | 3218 | | |
| 3219 | + | |
| 3220 | + | |
| 3221 | + | |
| 3222 | + | |
| 3223 | + | |
| 3224 | + | |
| 3225 | + | |
3216 | 3226 | | |
3217 | 3227 | | |
3218 | 3228 | | |
| |||
0 commit comments