Skip to content

Commit bb205d3

Browse files
committed
docs(examples): CLAUDE.md for the NeuralNetwork.NumSharp project
Project-specific CLAUDE.md at examples/NeuralNetwork.NumSharp/.claude/ so future agents working in the example project get the right context without needing to rediscover everything from the code. Contents (~280 lines): * Build / Run — csproj setup (Exe, net8+net10, AllowUnsafeBlocks), InternalsVisibleTo scope, where to drop real MNIST IDX files, current demo defaults (epochs=100, batch=128, Adam lr=1e-3, synthetic sigma=2.5, eval cadence min(5, epochs)). * Directory Map — every file with a one-line purpose. * MnistMlp fusion — the three NpyExpr trees that collapse the post-matmul element-wise chunks into single NpyIter kernels (forward ReLU bias+activation, forward linear bias-only, backward ReLU gradient mask). * Layer/Cost/Optimizer contract — what every BaseLayer subclass must populate (Input/Output/Grads/InputGrad, Parameters["w"/"b"]). * Sharp edges — 8 gotchas: historical np.dot strided 100x cliff (now fixed by the stride-aware GEMM), 2-index `x[i,j]` vs slice, argmax needing axis, np.allclose mutating its arguments via astype(copy:false), argmax returning Int64 not Int32, Adam's step counter needing monotonic iteration, pre-fix FC weight init, slice dtype. * Perf characteristics — 100-epoch run numbers, fusion probe, kernel cache + delegate-slot instrumentation. * Testing — the in-line `dotnet_run` smoke-test pattern. * Q&A — why Accuacy/BinaryAccuacy keep the typo, why SoftmaxCrossEntropy lives in MnistMlp/ rather than Cost/, when to use NeuralNet.Train vs MlpTrainer, real-MNIST expected accuracy. * Known limitations — no shuffling, no validation split, Adam re-allocates per step, no serialization, string-vs-enum activation inconsistency between FullyConnected and FullyConnectedFused.
1 parent 51ad43c commit bb205d3

1 file changed

Lines changed: 277 additions & 0 deletions

File tree

  • examples/NeuralNetwork.NumSharp/.claude
Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# NeuralNetwork.NumSharp Example Project
2+
3+
A small Keras-style neural-network framework built on top of NumSharp, plus an
4+
end-to-end MNIST 2-layer MLP demo that fuses the post-matmul element-wise work
5+
into a single NpyIter per layer via NpyExpr.
6+
7+
Dual purpose:
8+
1. **Library scaffolding**`BaseLayer`, `BaseActivation`, `BaseCost`,
9+
`BaseOptimizer`, `BaseMetric`, `NeuralNet` (sequential model runner).
10+
2. **Runnable MLP demo**`MnistMlp/Program.cs` trains a 784 → 128 ReLU → 10
11+
classifier on real MNIST (if IDX files present) or learnable synthetic
12+
data (fallback).
13+
14+
---
15+
16+
## Build / Run
17+
18+
```bash
19+
cd examples/NeuralNetwork.NumSharp
20+
dotnet build -v q --nologo "-clp:NoSummary;ErrorsOnly" -p:WarningLevel=0
21+
dotnet run --no-build --framework net8.0 # or --framework net10.0
22+
```
23+
24+
The csproj is an **Exe** (not a library) with `OutputType=Exe`,
25+
`AllowUnsafeBlocks=true`, multi-targets `net8.0;net10.0`. It has
26+
`InternalsVisibleTo("NeuralNetwork.NumSharp")` in `src/NumSharp.Core/Assembly/
27+
Properties.cs`, so `NpyIterRef`, `NpyExpr`, `ILKernelGenerator.InnerLoopCachedCount`,
28+
and `DelegateSlots.RegisteredCount` are all accessible.
29+
30+
Current demo defaults (in `MnistMlp/Program.cs`):
31+
- `Epochs = 100`, `BatchSize = 128`
32+
- Adam lr=1e-3
33+
- Synthetic-data noise sigma = 2.5 (in `MnistMlp/MnistLoader.cs`)
34+
- Test evaluation every `min(5, epochs)` epochs
35+
36+
Place real MNIST at `examples/NeuralNetwork.NumSharp/data/`:
37+
- `train-images.idx3-ubyte`, `train-labels.idx1-ubyte` (60k train)
38+
- `t10k-images.idx3-ubyte`, `t10k-labels.idx1-ubyte` (10k test)
39+
40+
---
41+
42+
## Directory Map
43+
44+
```
45+
examples/NeuralNetwork.NumSharp/
46+
├── NeuralNet.cs Sequential model (forward / backward / Train /
47+
│ Predict). Uses BaseLayer list + BaseCost +
48+
│ BaseOptimizer. Train now slices correctly.
49+
├── Util.cs int counter for layer-name uniqueness.
50+
51+
├── Layers/
52+
│ ├── BaseLayer.cs Abstract: Input, Output, Parameters["w"/"b"],
53+
│ │ Grads[...], InputGrad. Subclasses override
54+
│ │ Forward/Backward.
55+
│ ├── FullyConnected.cs Dense layer with bias + He/Xavier init (float32).
56+
│ │ Composes an optional BaseActivation by name.
57+
│ └── Activations/
58+
│ ├── BaseActivation.cs Get(name): resolves "relu"/"sigmoid" by name.
59+
│ ├── ReLU.cs (NDArray > 0) * NDArray formulation (works).
60+
│ ├── Sigmoid.cs 1/(1+exp(-x)); Backward uses cached Output.
61+
│ └── Softmax.cs Numerically-stable row-wise softmax;
62+
│ Backward = Output * (grad - Σ(grad*Output, axis=1, keepdims)).
63+
64+
├── Cost/
65+
│ ├── BaseCost.cs Abstract: Forward, Backward, float Epsilon.
66+
│ ├── CategoricalCrossentropy.cs L = -Σ(y*log(clip(p))) / batch;
67+
│ │ dL/dp = -y / clip(p) / batch.
68+
│ ├── BinaryCrossEntropy.cs mean(-y*log(clip(p)) - (1-y)*log(1-clip(p)));
69+
│ │ dL/dp = (p - y) / (p*(1-p)) / N.
70+
│ └── MeanSquaredError.cs mean((preds - labels)²); ∇ = 2*(preds-labels)/batch.
71+
72+
├── Metrics/
73+
│ ├── BaseMetric.cs Abstract: Calculate(preds, labels) → NDArray.
74+
│ ├── Accuracy.cs class Accuacy (typo preserved). argmax(preds,1)
75+
│ │ == argmax(labels,1), mean.
76+
│ ├── BinaryAccuacy.cs round(clip(preds, 0, 1)) == labels, mean.
77+
│ └── MeanAbsoluteError.cs mean(|preds - labels|).
78+
79+
├── Optimizers/
80+
│ ├── BaseOptimizer.cs Abstract. Get("sgd") / Get("adam") resolvers.
81+
│ ├── SGD.cs Vanilla SGD; classical momentum; inverse-time
82+
│ │ LR decay.
83+
│ └── Adam.cs First/second moments with proper np.zeros init.
84+
│ Step counter must be monotonic across run.
85+
86+
├── MnistMlp/ The runnable experiment. Files described below.
87+
88+
├── Open.snk Strong-name key shared with NumSharp.Core.
89+
└── NeuralNetwork.NumSharp.csproj Exe, net8.0+net10.0, AllowUnsafeBlocks.
90+
```
91+
92+
---
93+
94+
## MnistMlp — fused forward + backward
95+
96+
All fusion happens in `FullyConnectedFused`. The idea: every post-matmul
97+
element-wise chunk (bias-add + ReLU, bias-add only, ReLU gradient mask)
98+
collapses into **one NpyIter kernel**, compiled once per process and
99+
cache-hit on every subsequent forward/backward pass.
100+
101+
| Stage | NpyExpr tree | Inputs → Output |
102+
|---|---|---|
103+
| Forward ReLU | `Max(Input(0) + Input(1), Const(0f))` | (preact, bias) → y |
104+
| Forward linear | `Input(0) + Input(1)` | (preact, bias) → y |
105+
| Backward ReLU | `Input(0) * Greater(Input(1), Const(0f))` | (gradOut, y) → gradPreact |
106+
| Backward linear | — (pass-through) | gradOut → gradPreact |
107+
108+
**`MnistMlp/` files:**
109+
110+
| File | What it does |
111+
|---|---|
112+
| `Program.cs` | Entry point. Loads data, builds 2-FC model, runs fusion probe, trains via MlpTrainer, reports IL-kernel cache + delegate-slot counts. |
113+
| `MnistLoader.cs` | IDX parser (big-endian) + learnable synthetic fallback (shared class templates across train/test, sigma=2.5 noise). |
114+
| `FullyConnectedFused.cs` | FC with bias + optional fused activation. Three NpyIter kernels (two forward, one backward), cache keys are stable strings. |
115+
| `SoftmaxCrossEntropy.cs` | Combined loss — numerically stable softmax forward, cached softmax, (softmax-labels)/batch backward. Also ships `OneHot` helper. |
116+
| `MlpTrainer.cs` | Explicit train loop (`NeuralNet.Train` replacement). Periodic test eval (`min(5, epochs)` cadence). Returns per-epoch loss/train_acc + list of (epoch, test_acc) pairs. |
117+
| `FusedMlp.cs`, `NaiveMlp.cs` | Side-by-side forward implementations for the correctness probe at Program startup. |
118+
119+
---
120+
121+
## Layer / Cost / Optimizer contract
122+
123+
Every BaseLayer subclass MUST populate on Forward:
124+
- `this.Input = x` (via `base.Forward(x)`)
125+
- `this.Output = result`
126+
127+
And on Backward:
128+
- `this.Grads[key] = ∂L/∂param` for every entry in `this.Parameters`
129+
- `this.InputGrad = ∂L/∂x` (consumed by the previous layer)
130+
131+
Optimizers iterate `layer.Parameters.ToList()` and expect `layer.Grads[paramKey]`
132+
to be populated by Backward. Param-name convention is `"w"` / `"b"`.
133+
134+
BaseCost contract:
135+
- `Forward(preds, labels)` → scalar NDArray (the loss)
136+
- `Backward(preds, labels)` → NDArray shape-matched to preds (the first
137+
incoming gradient for the network's output layer)
138+
139+
BaseMetric contract:
140+
- `Calculate(preds, labels)` → scalar NDArray in [0, 1]
141+
142+
---
143+
144+
## Sharp edges that bit us
145+
146+
### 1. np.dot + strided operands (historical)
147+
Before the stride-aware GEMM shipped in `f5c05a7f`, `np.dot(x.T, grad)` with
148+
non-contiguous operands was **~100x slower** than contiguous (240 ms vs 2.5 ms
149+
on the layer-1 backward shapes). Workaround was `.transpose().copy()` before
150+
the dot. Now removed — the stride-aware kernel handles transposed views
151+
directly and is ~1.4x slower than fully-contig (normal stride overhead).
152+
Don't add `.copy()` back.
153+
154+
### 2. `x[i, j]` is 2-index element selection, NOT a slice
155+
`NeuralNet.Train` originally did `x[currentIndex, currentIndex + batchSize]`
156+
which read a single element, not a batch. Correct form:
157+
`x[$"{start}:{end}"]` — string-slicing the outer dim returns a view.
158+
159+
### 3. `np.argmax(x)` without axis returns a scalar
160+
For batched predictions you need `axis: 1`. The metrics previously returned
161+
scalars that matched two scalar argmaxes — broken for batches.
162+
163+
### 4. `np.allclose` mutates its arguments
164+
`np.allclose` calls `astype(Double, copy:false)` on both operands, which
165+
in-place flips their dtype from Single to Double. Use a manual max-abs-diff
166+
loop if you need the operands untouched. (This is a NumSharp core library
167+
bug — not fixed here.)
168+
169+
### 5. `np.argmax(preds, axis:1)` returns Int64
170+
When comparing against `labels.GetByte(i)` use `predIdx.GetInt64(i)`
171+
calling `GetInt32` on Int64 storage throws `Memory corruption expected`.
172+
173+
### 6. Adam step counter MUST be monotonic across the full run
174+
Don't reset per epoch. Adam's `1 - β^t` bias correction needs `t` to increase
175+
monotonically across the whole training run, otherwise the first batch of
176+
each epoch gets the same broken divisor (`1 - β^1` with β^1 close to β →
177+
large correction factor).
178+
179+
### 7. FullyConnected weight init was `normal(0.5, 1, ...)` (wrong)
180+
Float64 dtype, mean=0.5. Now He-normal for ReLU, Xavier/Glorot otherwise,
181+
all float32. If you see the class still using that init, you're looking at
182+
a pre-fix checkout.
183+
184+
### 8. Slice view dtype
185+
`images[$"0:{BatchSize}"]` preserves dtype. Feeding the slice directly to
186+
`np.dot` works. But the `np.dot` result dtype depends on input dtypes —
187+
float32 × float32 → float32, as expected. Use `.astype(NPTypeCode.Single)`
188+
after `np.random.normal(...)` which returns float64 by default.
189+
190+
---
191+
192+
## Perf characteristics
193+
194+
**100-epoch training on 6000 synthetic / 1000 test (batch=128, Adam, sigma=2.5):**
195+
- Epoch 1: loss ≈ 1.12, train_acc ≈ 73% (random init → partial fit)
196+
- Epoch 2: loss ≈ 0.009, train_acc ≈ 99.9%
197+
- Epoch 100: loss ≈ 0, test_acc ≈ 99.89%
198+
- Total training time: ~70 s (net8.0)
199+
200+
**Fusion probe on post-matmul bias+ReLU, batch (128, 128) fp32:**
201+
- Fused (1 NpyIter): ~0.14 ms
202+
- Naive (np.add + np.maximum): ~0.36 ms
203+
- Speedup: ~2.5x
204+
205+
**Instrumentation (after a 100-epoch run):**
206+
- IL kernel cache entries: delta of 6 (all unique fused expressions)
207+
- NpyExpr delegate slots: 0 (pure DSL, no captured lambdas)
208+
209+
---
210+
211+
## Testing
212+
213+
No dedicated MSTest project. The **smoke test** for the NN scaffolding lives
214+
in-line as a `dotnet run` stdin script — 29 checks covering:
215+
- Softmax forward + backward (finite-difference gradient check)
216+
- Sigmoid (saturation limits)
217+
- CCE / BCE (loss values + backward components)
218+
- Accuracy / BinaryAccuacy (argmax + round)
219+
- FullyConnected with bias (shape checks)
220+
- SGD vanilla + momentum (hand-computed trajectories)
221+
- `BaseOptimizer.Get("sgd")` / `Get("adam")`
222+
223+
Run pattern for ad-hoc sanity checks:
224+
```bash
225+
cat /tmp/script.cs | dotnet_run
226+
```
227+
where the script references the two projects via `#:project`.
228+
229+
---
230+
231+
## Q&A
232+
233+
**Why do we have both `FullyConnected` and `FullyConnectedFused`?**
234+
`FullyConnected` is the vanilla version that goes through `np.dot + (x + b) +
235+
activation` as separate ops. `FullyConnectedFused` collapses bias+activation
236+
into a single NpyIter — the fusion demo's point. Both share the BaseLayer
237+
contract and are interchangeable in a NeuralNet pipeline.
238+
239+
**Why do the metric classes have typos in their names?**
240+
`Accuacy`, `BinaryAccuacy` — misspelled in the original scaffolding, kept
241+
for backward compat with any external caller. Fixing the implementation
242+
without renaming the class is the lower-risk path.
243+
244+
**Why is SoftmaxCrossEntropy in `MnistMlp/` instead of `Cost/`?**
245+
It's the combined-form loss — assumes softmax is applied internally, not by
246+
a separate Softmax layer. The standalone `Softmax` + `CategoricalCrossentropy`
247+
chain still works and is numerically fine for most cases; SCE is faster and
248+
slightly more stable for the MLP demo's specific pipeline.
249+
250+
**Is `NeuralNet.Train` usable now?**
251+
Yes — the slicing bug is fixed (uses `$"{start}:{end}"` string-slice) and
252+
the optimizer step counter is monotonic. But `MnistMlp/MlpTrainer.cs` is
253+
still the richer path (periodic test eval, per-epoch timing output). Use
254+
`NeuralNet` for simple cases, `MlpTrainer` when you want instrumentation.
255+
256+
**Can we train on real MNIST?**
257+
Yes — drop the four IDX files into `examples/NeuralNetwork.NumSharp/data/`.
258+
The loader auto-detects and switches off synthetic. Real-MNIST accuracy
259+
with this 2-layer MLP should land ~97-98% after 10-20 epochs.
260+
261+
---
262+
263+
## Known limitations
264+
265+
- **No data shuffling.** `MlpTrainer` iterates batches in order. Works fine
266+
for synthetic data and MNIST (which is pre-shuffled) but would hurt
267+
generalization on ordered datasets.
268+
- **No validation split.** Train / test is a fixed split; no held-out
269+
validation for early stopping.
270+
- **Adam re-allocates per step.** Each Adam update allocates ~14 temp
271+
NDArrays per parameter. For a 2-layer FC this is ~200 ms/epoch of GC
272+
pressure. Fixable by fusing Adam's update into NpyIter like the rest,
273+
but out of scope for the current demo.
274+
- **No model serialization.** Parameters can't be saved / loaded yet.
275+
- **Activation resolution by string only.** `FullyConnected` takes `act =
276+
"relu"` etc. `FullyConnectedFused` uses an enum (`FusedActivation`) —
277+
the two are slightly inconsistent.

0 commit comments

Comments
 (0)