debug: revert PW entry probe (diagnostic complete)

runwangdl · runwangdl · commit f28e418a960b · 2026-04-19T22:58:04.000Z
Finding logged: blocks.0.pw X[0], dY[0], X_sum, X_sq, dY_sq at kernel
entry all match PyTorch autograd bit-exact (X_sum: sim 7229.462 vs
ref 7229.449 = 0.00018% diff; X_sq: sim 8970.664 vs ref 8970.658;
dY_sq summed across 4 tiles: sim 588.2 vs ref 588.3 = 0.02% noise).
So the kernel RECEIVES correct input but computes wrong dW[0] (1.77e-3
vs ref 2.95e-3 — 40% off).

Additional test: standalone ConvGradW_PW_block_0_Cout5 kernel test
(C_in=8, C_out=5, HW=48×48 — matches integrated tile shape) runs
bit-exact (2/40 values off at FP32 rounding noise 1e-6 relative).
So the kernel itself works correctly with C_out=5 inputs.

Conclusion: the bug is either in mm_add scheduling under multi-kernel
concurrent execution, or a subtle L1 buffer aliasing issue that only
occurs in the integrated TrainingNetwork tile schedule. Out of scope
for this session; tooling left in place for follow-up.
diff --git a/TargetLibraries/PULPOpen/src/ConvGrad.c b/TargetLibraries/PULPOpen/src/ConvGrad.c
@@ -732,12 +732,6 @@ void PULP_PWConvGradW2d_fp32_fp32_fp32_CHW(
     uint32_t C_out, const float *__restrict__ pInput, uint32_t H_in,
     uint32_t W_in, uint32_t C_in, float *__restrict__ pGradWeight) {
 
-  if (pi_core_id() == 0) {
-    static int __pw_entry = 0;
-    printf("[PWGRADW_ENTRY call=%d C_in=%u C_out=%u H=%u W=%u X[0]=%.9e dY[0]=%.9e]\r\n",
-           __pw_entry++, C_in, C_out, H_in, W_in, pInput[0], pGradOut[0]);
-  }
-
   struct blob input_blob = {0};
   struct blob output_blob = {0};
   struct blob coeff_blob = {0};