Skip to content

Commit f28e418

Browse files
committed
debug: revert PW entry probe (diagnostic complete)
Finding logged: blocks.0.pw X[0], dY[0], X_sum, X_sq, dY_sq at kernel entry all match PyTorch autograd bit-exact (X_sum: sim 7229.462 vs ref 7229.449 = 0.00018% diff; X_sq: sim 8970.664 vs ref 8970.658; dY_sq summed across 4 tiles: sim 588.2 vs ref 588.3 = 0.02% noise). So the kernel RECEIVES correct input but computes wrong dW[0] (1.77e-3 vs ref 2.95e-3 — 40% off). Additional test: standalone ConvGradW_PW_block_0_Cout5 kernel test (C_in=8, C_out=5, HW=48×48 — matches integrated tile shape) runs bit-exact (2/40 values off at FP32 rounding noise 1e-6 relative). So the kernel itself works correctly with C_out=5 inputs. Conclusion: the bug is either in mm_add scheduling under multi-kernel concurrent execution, or a subtle L1 buffer aliasing issue that only occurs in the integrated TrainingNetwork tile schedule. Out of scope for this session; tooling left in place for follow-up.
1 parent d4596f8 commit f28e418

1 file changed

Lines changed: 0 additions & 6 deletions

File tree

TargetLibraries/PULPOpen/src/ConvGrad.c

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -732,12 +732,6 @@ void PULP_PWConvGradW2d_fp32_fp32_fp32_CHW(
732732
uint32_t C_out, const float *__restrict__ pInput, uint32_t H_in,
733733
uint32_t W_in, uint32_t C_in, float *__restrict__ pGradWeight) {
734734

735-
if (pi_core_id() == 0) {
736-
static int __pw_entry = 0;
737-
printf("[PWGRADW_ENTRY call=%d C_in=%u C_out=%u H=%u W=%u X[0]=%.9e dY[0]=%.9e]\r\n",
738-
__pw_entry++, C_in, C_out, H_in, W_in, pInput[0], pGradOut[0]);
739-
}
740-
741735
struct blob input_blob = {0};
742736
struct blob output_blob = {0};
743737
struct blob coeff_blob = {0};

0 commit comments

Comments
 (0)