Add gradient-divergence debug toolkit

runwangdl · runwangdl · commit 065259d3333e · 2026-04-18T21:11:15.000Z
Branch-local tooling to diagnose the MobileNet step-1..3 loss divergence (open since 45694c4, which fixed the L1 OOB but not the numerical drift). debug/gen_pytorch_reference.py Rebuilds the Onnx4Deeploy MobileNetV1 in PyTorch, loads init weights + mb0 sample from inputs.npz, runs forward + cross_entropy + backward, dumps per-parameter grads to debug/ref_out/ref_grads_step0.{npz,json}. Sanity: PyTorch step-0 loss matches the vendored reference (0.798021) to ~6 decimal places, so the ref model is aligned. debug/inject_probes.py Adds an idempotent, removable probe block to DeeployTest/Platforms/Siracusa/src/deeploytraintest.c that (after step 0 backward, before optimizer) walks DeeployNetwork_inputs[85..167] — the 83 gradient accumulator buffers — and prints one greppable [PROBE i=N len=M first=<%08x,...> last=<...> sum=%.9e sq=%.9e] line per parameter. Float bit patterns are emitted as hex for lossless cross-check. debug/diff_grads.py Parses [PROBE ...] lines from the gvsoc log, pairs by index with the PyTorch reference (json), prints a PASS/BAD table per parameter. Exit 1 if any BAD. debug/run_probe.sh Orchestrator: ref -> regen -> inject -> build -> gvsoc run -> diff. Always strips the probe on exit so the harness stays clean. debug/README.md Usage + diagnosis walkthrough. debug/.gitignore Excludes ref_out/ (generated artefacts). Everything here is meant to be dropped when the numerical fix lands — this branch (mlperftiny_loop_debug_grad) should not merge back. Cherry-pick just the fix commit onto mlperftiny_loop and delete this branch.
diff --git a/debug/.gitignore b/debug/.gitignore
@@ -0,0 +1 @@
+ref_out/
diff --git a/debug/README.md b/debug/README.md
@@ -0,0 +1,104 @@
+# MobileNet gradient-divergence debug toolkit
+
+Branch: `mlperftiny_loop_debug_grad` (branched from `mlperftiny_loop` @ 45694c47).
+
+## Goal
+
+`mlperftiny_loop` head runs MobileNetV1 through 4 gvsoc training steps
+**without crash** (OOB fix in 45694c47), but losses 2–4 diverge from reference
+by ~1.7% (TOL=1%):
+
+| step | computed | ref | diff |
+|------|----------|----------|----------|
+| 0 | 0.798015 | 0.798021 | 5e-6 ✓ |
+| 1 | 0.771037 | 0.753528 | 0.017 ✗ |
+| 2 | 0.655666 | 0.666991 | 0.011 ✗ |
+| 3 | 0.640864 | 0.625697 | 0.015 ✗ |
+
+Step 0 forward is bit-exact, so the divergence is introduced during **step 0
+backward** (or optimizer step). Since ResNet8 passes bit-exact on the same
+branch, the bug is almost certainly in a DW-conv-specific backward path.
+
+## Approach — one step at a time
+
+Compare **per-parameter gradients** after step 0 backward:
+  - PyTorch reference (this tooling)
+  - gvsoc-simulated C kernel (instrumented with the probe block)
+
+If N-th parameter's gradient diverges, the bug is in the backward op that
+writes to that parameter's grad accumulator. Walk top-down (classifier →
+last block → first block) and stop at the first mismatch.
+
+## Files
+
+| file | role |
+|------|------|
+| `gen_pytorch_reference.py` | Rebuilds PyTorch MobileNetV1 (from Onnx4Deeploy), loads initial weights + mb0 input/label from `inputs.npz`, runs forward + `cross_entropy` + `backward()`, dumps per-param grads. |
+| `inject_probes.py` | Adds a probe block to `deeploytraintest.c` that (after step 0 backward, before optimizer) prints one `[PROBE i=N len=M first=... last=... sum=... sq=...]` line per gradient buffer. Idempotent and removable. |
+| `diff_grads.py` | Parses `[PROBE ...]` lines from gvsoc log, pairs with PyTorch ref by index, prints BAD/PASS per parameter. |
+| `run_probe.sh` | End-to-end: ref → regen → inject → build → run → diff. Always strips the probe block on exit. |
+
+## Usage
+
+```bash
+cd /home/agent/work/Deeploy-mlperftiny
+bash debug/run_probe.sh
+```
+
+Output lives in:
+- `debug/ref_out/ref_grads_step0.{npz,json}` — PyTorch reference
+- `/tmp/gvsoc_probe.log` — full gvsoc log (greppable)
+- stdout — per-param PASS/BAD table
+
+### Iterating on the probe
+
+The probe block is defined inline in `inject_probes.py::PROBE_BLOCK`. Edit
+that string, then re-run `run_probe.sh` — it calls `inject_probes.py --remove`
+on exit, so next run starts from a clean harness.
+
+### Iterating on the reference
+
+If you suspect the Onnx4Deeploy PyTorch MobileNetV1 differs from the exporter's
+actual training logic (SGD momentum, BN running stats, etc.), compare
+`ref_grads_step0.json['step0_loss_pytorch']` against `['step0_loss_vendored']`
+— they should match to ~6 decimal places. They do today (0.798021 both), so
+the ref model is aligned.
+
+### Narrowing the diagnosis
+
+The probe dumps ALL 83 params. The diff table makes the first BAD row the
+main clue:
+
+- **BAD at i=82 (classifier_bias) and upward** — SCE/Gemm/pool backward issue.
+- **PASS at i≥82, BAD at a DW-related index** — DW backward numeric bug.
+- **All DW PASS, BAD at a PW index** — regular ConvGrad regression (would
+  also break ResNet8 — unlikely).
+
+## Parameter ordering
+
+Matches PyTorch `model.named_parameters()` = ONNX `graph.input` order:
+
+```
+0  stem_0_weight
+1  stem_1_weight            (BN stem gamma)
+2  stem_1_bias              (BN stem beta)
+3  blocks_0_dw_weight
+4  blocks_0_bn_dw_weight
+5  blocks_0_bn_dw_bias
+6  blocks_0_pw_weight
+... (6 params per block × 13 blocks = 78) ...
+81 classifier_weight
+82 classifier_bias
+```
+
+## When the bug is fixed
+
+1. Apply the fix commit on top of this branch.
+2. Rerun `bash debug/run_probe.sh` — expect all PASS.
+3. Cherry-pick just the fix commit onto `mlperftiny_loop`:
+   ```bash
+   git checkout mlperftiny_loop
+   git cherry-pick <fix-sha>
+   ```
+4. This branch (`mlperftiny_loop_debug_grad`) can be archived or deleted —
+   the debug tooling is not meant to ship.
diff --git a/debug/diff_grads.py b/debug/diff_grads.py
@@ -0,0 +1,129 @@
+#!/usr/bin/env python3
+"""Compare gvsoc-emitted [PROBE ...] lines against the PyTorch reference.
+
+Pairs each `[PROBE i=N ...]` line with the N-th PyTorch parameter (same
+ordering as `model.named_parameters()`, which matches ONNX graph input order)
+and prints a compact side-by-side diff highlighting divergences.
+
+A gradient entry is flagged "BAD" when:
+  * sum or sq_sum differs from reference by > --rtol   (default 1e-3)
+  * OR any of the 16 sampled float bit patterns mismatches
+
+Usage:
+  python debug/diff_grads.py \\
+      --log  /tmp/gvsoc_mobilenet_probe.log \\
+      --ref  debug/ref_out/ref_grads_step0.json
+
+Exit code: 0 if all PASS, 1 if any BAD.
+"""
+import argparse
+import json
+import math
+import re
+import struct
+import sys
+from pathlib import Path
+
+
+_PROBE_RE = re.compile(
+    r"\[PROBE i=(?P<i>\d+) len=(?P<len>\d+) "
+    r"first=(?P<first>[0-9a-fA-F,]+) "
+    r"last=(?P<last>[0-9a-fA-F,]+) "
+    r"sum=(?P<sum>[0-9.\-+eE]+) sq=(?P<sq>[0-9.\-+eE]+)\]"
+)
+
+
+def hex_to_float(h: str) -> float:
+    return struct.unpack("<f", struct.pack("<I", int(h, 16)))[0]
+
+
+def parse_probe_log(path: Path) -> dict:
+    out = {}
+    for line in path.read_text(errors="ignore").splitlines():
+        m = _PROBE_RE.search(line)
+        if not m:
+            continue
+        idx = int(m["i"])
+        out[idx] = {
+            "len": int(m["len"]),
+            "first": [hex_to_float(h) for h in m["first"].split(",")],
+            "last": [hex_to_float(h) for h in m["last"].split(",")],
+            "sum": float(m["sum"]),
+            "sq_sum": float(m["sq"]),
+        }
+    return out
+
+
+def param_names_ordered(ref: dict) -> list:
+    # json preserves insertion order
+    return list(ref["params"].keys())
+
+
+def compare(
+    probes: dict, ref: dict, rtol: float = 1e-3
+) -> tuple[int, int]:
+    names = param_names_ordered(ref)
+    pass_count = 0
+    bad_count = 0
+    print(f"{'idx':>3} {'name':<48} {'len':>8} {'sum ref':>14} {'sum sim':>14} {'diff %':>8}  verdict")
+    print("-" * 110)
+    for idx, name in enumerate(names):
+        if idx not in probes:
+            print(f"{idx:>3} {name:<48} {'—':>8} {'—':>14} {'—':>14} {'—':>8}  NO_PROBE")
+            bad_count += 1
+            continue
+        pr = probes[idx]
+        rf = ref["params"][name]
+        # length sanity
+        if pr["len"] != rf["len"]:
+            print(
+                f"{idx:>3} {name:<48} {pr['len']:>8} "
+                f"{'LENMISMATCH':>14} ref.len={rf['len']}"
+            )
+            bad_count += 1
+            continue
+        # sum compare (absolute tolerance scaled by magnitude)
+        denom = max(abs(rf["sum"]), 1e-12)
+        rel = abs(pr["sum"] - rf["sum"]) / denom
+        # also check hex-matched first/last for tighter bit-level info
+        hex_diff = sum(
+            1 for a, b in zip(pr["first"], rf["first"])
+            if not (math.isclose(a, b, rel_tol=rtol, abs_tol=1e-7))
+        ) + sum(
+            1 for a, b in zip(pr["last"], rf["last"])
+            if not (math.isclose(a, b, rel_tol=rtol, abs_tol=1e-7))
+        )
+        verdict = "PASS" if (rel < rtol and hex_diff == 0) else "BAD"
+        if verdict == "PASS":
+            pass_count += 1
+        else:
+            bad_count += 1
+        print(
+            f"{idx:>3} {name:<48} {pr['len']:>8} "
+            f"{rf['sum']:>14.6e} {pr['sum']:>14.6e} {rel*100:>7.2f}%  {verdict}"
+            + (f"  hex_mismatches={hex_diff}" if hex_diff else "")
+        )
+    print("-" * 110)
+    print(f"pass={pass_count} bad={bad_count}")
+    return pass_count, bad_count
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--log", required=True, type=Path)
+    ap.add_argument("--ref", required=True, type=Path)
+    ap.add_argument("--rtol", type=float, default=1e-3)
+    args = ap.parse_args()
+
+    probes = parse_probe_log(args.log)
+    if not probes:
+        print(f"ERROR: no [PROBE ...] lines found in {args.log}", file=sys.stderr)
+        sys.exit(2)
+    ref = json.loads(args.ref.read_text())
+
+    _, bad = compare(probes, ref, rtol=args.rtol)
+    sys.exit(1 if bad else 0)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/debug/gen_pytorch_reference.py b/debug/gen_pytorch_reference.py
@@ -0,0 +1,151 @@
+#!/usr/bin/env python3
+"""Generate PyTorch reference gradients for MobileNetV1 step 0.
+
+Loads the vendored `network.onnx` initial weights + first mini-batch input
+from `inputs.npz`, runs forward + CE-loss + backward in PyTorch on the
+Onnx4Deeploy MobileNetV1 arch, dumps per-parameter gradient tensors + stats
+for side-by-side comparison with gvsoc-simulated gradients.
+
+Outputs:
+  ref_grads_step0.npz    — full grad tensors, keyed by PyTorch param name
+  ref_grads_step0.json   — per-param {first[8], last[8], sum, sq_sum, len}
+                           (easy to grep + cross-check with [PROBE ...] lines)
+
+Usage:
+  python debug/gen_pytorch_reference.py \\
+      --test-dir DeeployTest/Tests/Models/Training/MobileNetV1/mobilenetv1_train \\
+      --out-dir  debug/ref_out
+"""
+import argparse
+import json
+import sys
+from pathlib import Path
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+
+
+def load_mobilenet_model():
+    sys.path.insert(0, "/home/agent/work/Onnx4Deeploy")
+    from onnx4deeploy.models.pytorch_models.mobilenet.mobilenetv1 import mobilenet_v1
+    return mobilenet_v1(num_classes=2, width_mult=0.25, input_channels=3)
+
+
+def load_initial_weights(model: torch.nn.Module, inputs_npz_path: Path) -> None:
+    """inputs.npz layout:
+       arr_0000 = input image (mb0), arr_0001 = label (mb0),
+       arr_0002 .. arr_00(N+1) = trainable params in ONNX graph.input order.
+       The PyTorch model.named_parameters() order matches the ONNX order
+       (verified by shape: stem first, then blocks in order, then classifier).
+    """
+    data = np.load(inputs_npz_path)
+    param_list = list(model.named_parameters())
+    for i, (name, p) in enumerate(param_list):
+        arr = data[f"arr_{i + 2:04d}"]
+        if tuple(arr.shape) != tuple(p.shape):
+            raise RuntimeError(
+                f"Shape mismatch loading {name}: npz has {arr.shape}, "
+                f"PyTorch expects {tuple(p.shape)}"
+            )
+        p.data = torch.from_numpy(arr.copy()).float()
+    # sanity: total params
+    total = sum(p.numel() for _, p in param_list)
+    print(f"[ref] loaded {len(param_list)} params, total elements={total}")
+
+
+def eval_bn_or_train_bn(model: torch.nn.Module, mode: str) -> None:
+    """MLPerfTiny training uses train-mode BN (batch stats).
+    The Deeploy-generated graph uses BatchNormInternal which is train-mode BN.
+    Leave that as train() by default; eval() mode is optional for probing."""
+    if mode == "train":
+        model.train()
+    elif mode == "eval":
+        model.eval()
+    else:
+        raise ValueError(mode)
+
+
+def compute_step0_gradients(model: torch.nn.Module, inputs_npz_path: Path) -> dict:
+    data = np.load(inputs_npz_path)
+    x = torch.from_numpy(data["arr_0000"].copy()).float()
+    y = torch.from_numpy(data["arr_0001"].copy()).long()
+
+    model.zero_grad(set_to_none=True)
+    logits = model(x)
+    loss = F.cross_entropy(logits, y, reduction="mean")
+    loss.backward()
+    print(f"[ref] step 0 loss = {loss.item():.6f}")
+
+    grads = {}
+    for name, p in model.named_parameters():
+        if p.grad is None:
+            print(f"[ref] WARN: {name} has no grad")
+            continue
+        grads[name] = p.grad.detach().cpu().numpy().astype(np.float32)
+    return grads, loss.item()
+
+
+def summarise(grad_tensors: dict) -> dict:
+    out = {}
+    for name, g in grad_tensors.items():
+        flat = g.reshape(-1)
+        out[name] = {
+            "len": int(flat.size),
+            "first": [float(v) for v in flat[:8].tolist()],
+            "last": [float(v) for v in flat[-8:].tolist()],
+            "sum": float(flat.sum()),
+            "sq_sum": float((flat * flat).sum()),
+            "max_abs": float(np.abs(flat).max()),
+        }
+    return out
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument(
+        "--test-dir",
+        default="DeeployTest/Tests/Models/Training/MobileNetV1/mobilenetv1_train",
+    )
+    ap.add_argument("--out-dir", default="debug/ref_out")
+    ap.add_argument("--bn-mode", default="train", choices=["train", "eval"])
+    args = ap.parse_args()
+
+    test_dir = Path(args.test_dir)
+    out_dir = Path(args.out_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    inputs_npz = test_dir / "inputs.npz"
+    outputs_npz = test_dir / "outputs.npz"
+    if not inputs_npz.exists():
+        raise SystemExit(f"missing {inputs_npz}")
+
+    # Reference losses from outputs.npz (for cross-check of forward)
+    ref_losses = np.load(outputs_npz)["loss"]
+    print(f"[ref] vendored reference losses: {ref_losses.tolist()}")
+
+    model = load_mobilenet_model()
+    load_initial_weights(model, inputs_npz)
+    eval_bn_or_train_bn(model, args.bn_mode)
+
+    grads, step0_loss = compute_step0_gradients(model, inputs_npz)
+
+    # Full tensors
+    npz_out = out_dir / "ref_grads_step0.npz"
+    np.savez(npz_out, **grads)
+    print(f"[ref] wrote full grads: {npz_out}  ({len(grads)} tensors)")
+
+    # Compact summaries
+    summary = {
+        "step0_loss_pytorch": step0_loss,
+        "step0_loss_vendored": float(ref_losses[0]),
+        "bn_mode": args.bn_mode,
+        "params": summarise(grads),
+    }
+    json_out = out_dir / "ref_grads_step0.json"
+    json_out.write_text(json.dumps(summary, indent=2))
+    print(f"[ref] wrote summary: {json_out}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/debug/inject_probes.py b/debug/inject_probes.py
diff --git a/debug/run_probe.sh b/debug/run_probe.sh