Skip to content

Commit 065259d

Browse files
committed
Add gradient-divergence debug toolkit
Branch-local tooling to diagnose the MobileNet step-1..3 loss divergence (open since 45694c4, which fixed the L1 OOB but not the numerical drift). debug/gen_pytorch_reference.py Rebuilds the Onnx4Deeploy MobileNetV1 in PyTorch, loads init weights + mb0 sample from inputs.npz, runs forward + cross_entropy + backward, dumps per-parameter grads to debug/ref_out/ref_grads_step0.{npz,json}. Sanity: PyTorch step-0 loss matches the vendored reference (0.798021) to ~6 decimal places, so the ref model is aligned. debug/inject_probes.py Adds an idempotent, removable probe block to DeeployTest/Platforms/Siracusa/src/deeploytraintest.c that (after step 0 backward, before optimizer) walks DeeployNetwork_inputs[85..167] — the 83 gradient accumulator buffers — and prints one greppable [PROBE i=N len=M first=<%08x,...> last=<...> sum=%.9e sq=%.9e] line per parameter. Float bit patterns are emitted as hex for lossless cross-check. debug/diff_grads.py Parses [PROBE ...] lines from the gvsoc log, pairs by index with the PyTorch reference (json), prints a PASS/BAD table per parameter. Exit 1 if any BAD. debug/run_probe.sh Orchestrator: ref -> regen -> inject -> build -> gvsoc run -> diff. Always strips the probe on exit so the harness stays clean. debug/README.md Usage + diagnosis walkthrough. debug/.gitignore Excludes ref_out/ (generated artefacts). Everything here is meant to be dropped when the numerical fix lands — this branch (mlperftiny_loop_debug_grad) should not merge back. Cherry-pick just the fix commit onto mlperftiny_loop and delete this branch.
1 parent 45694c4 commit 065259d

6 files changed

Lines changed: 626 additions & 0 deletions

File tree

debug/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
ref_out/

debug/README.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# MobileNet gradient-divergence debug toolkit
2+
3+
Branch: `mlperftiny_loop_debug_grad` (branched from `mlperftiny_loop` @ 45694c47).
4+
5+
## Goal
6+
7+
`mlperftiny_loop` head runs MobileNetV1 through 4 gvsoc training steps
8+
**without crash** (OOB fix in 45694c47), but losses 2–4 diverge from reference
9+
by ~1.7% (TOL=1%):
10+
11+
| step | computed | ref | diff |
12+
|------|----------|----------|----------|
13+
| 0 | 0.798015 | 0.798021 | 5e-6 ✓ |
14+
| 1 | 0.771037 | 0.753528 | 0.017 ✗ |
15+
| 2 | 0.655666 | 0.666991 | 0.011 ✗ |
16+
| 3 | 0.640864 | 0.625697 | 0.015 ✗ |
17+
18+
Step 0 forward is bit-exact, so the divergence is introduced during **step 0
19+
backward** (or optimizer step). Since ResNet8 passes bit-exact on the same
20+
branch, the bug is almost certainly in a DW-conv-specific backward path.
21+
22+
## Approach — one step at a time
23+
24+
Compare **per-parameter gradients** after step 0 backward:
25+
- PyTorch reference (this tooling)
26+
- gvsoc-simulated C kernel (instrumented with the probe block)
27+
28+
If N-th parameter's gradient diverges, the bug is in the backward op that
29+
writes to that parameter's grad accumulator. Walk top-down (classifier →
30+
last block → first block) and stop at the first mismatch.
31+
32+
## Files
33+
34+
| file | role |
35+
|------|------|
36+
| `gen_pytorch_reference.py` | Rebuilds PyTorch MobileNetV1 (from Onnx4Deeploy), loads initial weights + mb0 input/label from `inputs.npz`, runs forward + `cross_entropy` + `backward()`, dumps per-param grads. |
37+
| `inject_probes.py` | Adds a probe block to `deeploytraintest.c` that (after step 0 backward, before optimizer) prints one `[PROBE i=N len=M first=... last=... sum=... sq=...]` line per gradient buffer. Idempotent and removable. |
38+
| `diff_grads.py` | Parses `[PROBE ...]` lines from gvsoc log, pairs with PyTorch ref by index, prints BAD/PASS per parameter. |
39+
| `run_probe.sh` | End-to-end: ref → regen → inject → build → run → diff. Always strips the probe block on exit. |
40+
41+
## Usage
42+
43+
```bash
44+
cd /home/agent/work/Deeploy-mlperftiny
45+
bash debug/run_probe.sh
46+
```
47+
48+
Output lives in:
49+
- `debug/ref_out/ref_grads_step0.{npz,json}` — PyTorch reference
50+
- `/tmp/gvsoc_probe.log` — full gvsoc log (greppable)
51+
- stdout — per-param PASS/BAD table
52+
53+
### Iterating on the probe
54+
55+
The probe block is defined inline in `inject_probes.py::PROBE_BLOCK`. Edit
56+
that string, then re-run `run_probe.sh` — it calls `inject_probes.py --remove`
57+
on exit, so next run starts from a clean harness.
58+
59+
### Iterating on the reference
60+
61+
If you suspect the Onnx4Deeploy PyTorch MobileNetV1 differs from the exporter's
62+
actual training logic (SGD momentum, BN running stats, etc.), compare
63+
`ref_grads_step0.json['step0_loss_pytorch']` against `['step0_loss_vendored']`
64+
— they should match to ~6 decimal places. They do today (0.798021 both), so
65+
the ref model is aligned.
66+
67+
### Narrowing the diagnosis
68+
69+
The probe dumps ALL 83 params. The diff table makes the first BAD row the
70+
main clue:
71+
72+
- **BAD at i=82 (classifier_bias) and upward** — SCE/Gemm/pool backward issue.
73+
- **PASS at i≥82, BAD at a DW-related index** — DW backward numeric bug.
74+
- **All DW PASS, BAD at a PW index** — regular ConvGrad regression (would
75+
also break ResNet8 — unlikely).
76+
77+
## Parameter ordering
78+
79+
Matches PyTorch `model.named_parameters()` = ONNX `graph.input` order:
80+
81+
```
82+
0 stem_0_weight
83+
1 stem_1_weight (BN stem gamma)
84+
2 stem_1_bias (BN stem beta)
85+
3 blocks_0_dw_weight
86+
4 blocks_0_bn_dw_weight
87+
5 blocks_0_bn_dw_bias
88+
6 blocks_0_pw_weight
89+
... (6 params per block × 13 blocks = 78) ...
90+
81 classifier_weight
91+
82 classifier_bias
92+
```
93+
94+
## When the bug is fixed
95+
96+
1. Apply the fix commit on top of this branch.
97+
2. Rerun `bash debug/run_probe.sh` — expect all PASS.
98+
3. Cherry-pick just the fix commit onto `mlperftiny_loop`:
99+
```bash
100+
git checkout mlperftiny_loop
101+
git cherry-pick <fix-sha>
102+
```
103+
4. This branch (`mlperftiny_loop_debug_grad`) can be archived or deleted —
104+
the debug tooling is not meant to ship.

debug/diff_grads.py

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
#!/usr/bin/env python3
2+
"""Compare gvsoc-emitted [PROBE ...] lines against the PyTorch reference.
3+
4+
Pairs each `[PROBE i=N ...]` line with the N-th PyTorch parameter (same
5+
ordering as `model.named_parameters()`, which matches ONNX graph input order)
6+
and prints a compact side-by-side diff highlighting divergences.
7+
8+
A gradient entry is flagged "BAD" when:
9+
* sum or sq_sum differs from reference by > --rtol (default 1e-3)
10+
* OR any of the 16 sampled float bit patterns mismatches
11+
12+
Usage:
13+
python debug/diff_grads.py \\
14+
--log /tmp/gvsoc_mobilenet_probe.log \\
15+
--ref debug/ref_out/ref_grads_step0.json
16+
17+
Exit code: 0 if all PASS, 1 if any BAD.
18+
"""
19+
import argparse
20+
import json
21+
import math
22+
import re
23+
import struct
24+
import sys
25+
from pathlib import Path
26+
27+
28+
_PROBE_RE = re.compile(
29+
r"\[PROBE i=(?P<i>\d+) len=(?P<len>\d+) "
30+
r"first=(?P<first>[0-9a-fA-F,]+) "
31+
r"last=(?P<last>[0-9a-fA-F,]+) "
32+
r"sum=(?P<sum>[0-9.\-+eE]+) sq=(?P<sq>[0-9.\-+eE]+)\]"
33+
)
34+
35+
36+
def hex_to_float(h: str) -> float:
37+
return struct.unpack("<f", struct.pack("<I", int(h, 16)))[0]
38+
39+
40+
def parse_probe_log(path: Path) -> dict:
41+
out = {}
42+
for line in path.read_text(errors="ignore").splitlines():
43+
m = _PROBE_RE.search(line)
44+
if not m:
45+
continue
46+
idx = int(m["i"])
47+
out[idx] = {
48+
"len": int(m["len"]),
49+
"first": [hex_to_float(h) for h in m["first"].split(",")],
50+
"last": [hex_to_float(h) for h in m["last"].split(",")],
51+
"sum": float(m["sum"]),
52+
"sq_sum": float(m["sq"]),
53+
}
54+
return out
55+
56+
57+
def param_names_ordered(ref: dict) -> list:
58+
# json preserves insertion order
59+
return list(ref["params"].keys())
60+
61+
62+
def compare(
63+
probes: dict, ref: dict, rtol: float = 1e-3
64+
) -> tuple[int, int]:
65+
names = param_names_ordered(ref)
66+
pass_count = 0
67+
bad_count = 0
68+
print(f"{'idx':>3} {'name':<48} {'len':>8} {'sum ref':>14} {'sum sim':>14} {'diff %':>8} verdict")
69+
print("-" * 110)
70+
for idx, name in enumerate(names):
71+
if idx not in probes:
72+
print(f"{idx:>3} {name:<48} {'—':>8} {'—':>14} {'—':>14} {'—':>8} NO_PROBE")
73+
bad_count += 1
74+
continue
75+
pr = probes[idx]
76+
rf = ref["params"][name]
77+
# length sanity
78+
if pr["len"] != rf["len"]:
79+
print(
80+
f"{idx:>3} {name:<48} {pr['len']:>8} "
81+
f"{'LENMISMATCH':>14} ref.len={rf['len']}"
82+
)
83+
bad_count += 1
84+
continue
85+
# sum compare (absolute tolerance scaled by magnitude)
86+
denom = max(abs(rf["sum"]), 1e-12)
87+
rel = abs(pr["sum"] - rf["sum"]) / denom
88+
# also check hex-matched first/last for tighter bit-level info
89+
hex_diff = sum(
90+
1 for a, b in zip(pr["first"], rf["first"])
91+
if not (math.isclose(a, b, rel_tol=rtol, abs_tol=1e-7))
92+
) + sum(
93+
1 for a, b in zip(pr["last"], rf["last"])
94+
if not (math.isclose(a, b, rel_tol=rtol, abs_tol=1e-7))
95+
)
96+
verdict = "PASS" if (rel < rtol and hex_diff == 0) else "BAD"
97+
if verdict == "PASS":
98+
pass_count += 1
99+
else:
100+
bad_count += 1
101+
print(
102+
f"{idx:>3} {name:<48} {pr['len']:>8} "
103+
f"{rf['sum']:>14.6e} {pr['sum']:>14.6e} {rel*100:>7.2f}% {verdict}"
104+
+ (f" hex_mismatches={hex_diff}" if hex_diff else "")
105+
)
106+
print("-" * 110)
107+
print(f"pass={pass_count} bad={bad_count}")
108+
return pass_count, bad_count
109+
110+
111+
def main():
112+
ap = argparse.ArgumentParser()
113+
ap.add_argument("--log", required=True, type=Path)
114+
ap.add_argument("--ref", required=True, type=Path)
115+
ap.add_argument("--rtol", type=float, default=1e-3)
116+
args = ap.parse_args()
117+
118+
probes = parse_probe_log(args.log)
119+
if not probes:
120+
print(f"ERROR: no [PROBE ...] lines found in {args.log}", file=sys.stderr)
121+
sys.exit(2)
122+
ref = json.loads(args.ref.read_text())
123+
124+
_, bad = compare(probes, ref, rtol=args.rtol)
125+
sys.exit(1 if bad else 0)
126+
127+
128+
if __name__ == "__main__":
129+
main()

debug/gen_pytorch_reference.py

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
#!/usr/bin/env python3
2+
"""Generate PyTorch reference gradients for MobileNetV1 step 0.
3+
4+
Loads the vendored `network.onnx` initial weights + first mini-batch input
5+
from `inputs.npz`, runs forward + CE-loss + backward in PyTorch on the
6+
Onnx4Deeploy MobileNetV1 arch, dumps per-parameter gradient tensors + stats
7+
for side-by-side comparison with gvsoc-simulated gradients.
8+
9+
Outputs:
10+
ref_grads_step0.npz — full grad tensors, keyed by PyTorch param name
11+
ref_grads_step0.json — per-param {first[8], last[8], sum, sq_sum, len}
12+
(easy to grep + cross-check with [PROBE ...] lines)
13+
14+
Usage:
15+
python debug/gen_pytorch_reference.py \\
16+
--test-dir DeeployTest/Tests/Models/Training/MobileNetV1/mobilenetv1_train \\
17+
--out-dir debug/ref_out
18+
"""
19+
import argparse
20+
import json
21+
import sys
22+
from pathlib import Path
23+
24+
import numpy as np
25+
import torch
26+
import torch.nn.functional as F
27+
28+
29+
def load_mobilenet_model():
30+
sys.path.insert(0, "/home/agent/work/Onnx4Deeploy")
31+
from onnx4deeploy.models.pytorch_models.mobilenet.mobilenetv1 import mobilenet_v1
32+
return mobilenet_v1(num_classes=2, width_mult=0.25, input_channels=3)
33+
34+
35+
def load_initial_weights(model: torch.nn.Module, inputs_npz_path: Path) -> None:
36+
"""inputs.npz layout:
37+
arr_0000 = input image (mb0), arr_0001 = label (mb0),
38+
arr_0002 .. arr_00(N+1) = trainable params in ONNX graph.input order.
39+
The PyTorch model.named_parameters() order matches the ONNX order
40+
(verified by shape: stem first, then blocks in order, then classifier).
41+
"""
42+
data = np.load(inputs_npz_path)
43+
param_list = list(model.named_parameters())
44+
for i, (name, p) in enumerate(param_list):
45+
arr = data[f"arr_{i + 2:04d}"]
46+
if tuple(arr.shape) != tuple(p.shape):
47+
raise RuntimeError(
48+
f"Shape mismatch loading {name}: npz has {arr.shape}, "
49+
f"PyTorch expects {tuple(p.shape)}"
50+
)
51+
p.data = torch.from_numpy(arr.copy()).float()
52+
# sanity: total params
53+
total = sum(p.numel() for _, p in param_list)
54+
print(f"[ref] loaded {len(param_list)} params, total elements={total}")
55+
56+
57+
def eval_bn_or_train_bn(model: torch.nn.Module, mode: str) -> None:
58+
"""MLPerfTiny training uses train-mode BN (batch stats).
59+
The Deeploy-generated graph uses BatchNormInternal which is train-mode BN.
60+
Leave that as train() by default; eval() mode is optional for probing."""
61+
if mode == "train":
62+
model.train()
63+
elif mode == "eval":
64+
model.eval()
65+
else:
66+
raise ValueError(mode)
67+
68+
69+
def compute_step0_gradients(model: torch.nn.Module, inputs_npz_path: Path) -> dict:
70+
data = np.load(inputs_npz_path)
71+
x = torch.from_numpy(data["arr_0000"].copy()).float()
72+
y = torch.from_numpy(data["arr_0001"].copy()).long()
73+
74+
model.zero_grad(set_to_none=True)
75+
logits = model(x)
76+
loss = F.cross_entropy(logits, y, reduction="mean")
77+
loss.backward()
78+
print(f"[ref] step 0 loss = {loss.item():.6f}")
79+
80+
grads = {}
81+
for name, p in model.named_parameters():
82+
if p.grad is None:
83+
print(f"[ref] WARN: {name} has no grad")
84+
continue
85+
grads[name] = p.grad.detach().cpu().numpy().astype(np.float32)
86+
return grads, loss.item()
87+
88+
89+
def summarise(grad_tensors: dict) -> dict:
90+
out = {}
91+
for name, g in grad_tensors.items():
92+
flat = g.reshape(-1)
93+
out[name] = {
94+
"len": int(flat.size),
95+
"first": [float(v) for v in flat[:8].tolist()],
96+
"last": [float(v) for v in flat[-8:].tolist()],
97+
"sum": float(flat.sum()),
98+
"sq_sum": float((flat * flat).sum()),
99+
"max_abs": float(np.abs(flat).max()),
100+
}
101+
return out
102+
103+
104+
def main():
105+
ap = argparse.ArgumentParser()
106+
ap.add_argument(
107+
"--test-dir",
108+
default="DeeployTest/Tests/Models/Training/MobileNetV1/mobilenetv1_train",
109+
)
110+
ap.add_argument("--out-dir", default="debug/ref_out")
111+
ap.add_argument("--bn-mode", default="train", choices=["train", "eval"])
112+
args = ap.parse_args()
113+
114+
test_dir = Path(args.test_dir)
115+
out_dir = Path(args.out_dir)
116+
out_dir.mkdir(parents=True, exist_ok=True)
117+
118+
inputs_npz = test_dir / "inputs.npz"
119+
outputs_npz = test_dir / "outputs.npz"
120+
if not inputs_npz.exists():
121+
raise SystemExit(f"missing {inputs_npz}")
122+
123+
# Reference losses from outputs.npz (for cross-check of forward)
124+
ref_losses = np.load(outputs_npz)["loss"]
125+
print(f"[ref] vendored reference losses: {ref_losses.tolist()}")
126+
127+
model = load_mobilenet_model()
128+
load_initial_weights(model, inputs_npz)
129+
eval_bn_or_train_bn(model, args.bn_mode)
130+
131+
grads, step0_loss = compute_step0_gradients(model, inputs_npz)
132+
133+
# Full tensors
134+
npz_out = out_dir / "ref_grads_step0.npz"
135+
np.savez(npz_out, **grads)
136+
print(f"[ref] wrote full grads: {npz_out} ({len(grads)} tensors)")
137+
138+
# Compact summaries
139+
summary = {
140+
"step0_loss_pytorch": step0_loss,
141+
"step0_loss_vendored": float(ref_losses[0]),
142+
"bn_mode": args.bn_mode,
143+
"params": summarise(grads),
144+
}
145+
json_out = out_dir / "ref_grads_step0.json"
146+
json_out.write_text(json.dumps(summary, indent=2))
147+
print(f"[ref] wrote summary: {json_out}")
148+
149+
150+
if __name__ == "__main__":
151+
main()

0 commit comments

Comments
 (0)