|
| 1 | +# NeuralNetwork.NumSharp Example Project |
| 2 | + |
| 3 | +A small Keras-style neural-network framework built on top of NumSharp, plus an |
| 4 | +end-to-end MNIST 2-layer MLP demo that fuses the post-matmul element-wise work |
| 5 | +into a single NpyIter per layer via NpyExpr. |
| 6 | + |
| 7 | +Dual purpose: |
| 8 | +1. **Library scaffolding** — `BaseLayer`, `BaseActivation`, `BaseCost`, |
| 9 | + `BaseOptimizer`, `BaseMetric`, `NeuralNet` (sequential model runner). |
| 10 | +2. **Runnable MLP demo** — `MnistMlp/Program.cs` trains a 784 → 128 ReLU → 10 |
| 11 | + classifier on real MNIST (if IDX files present) or learnable synthetic |
| 12 | + data (fallback). |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Build / Run |
| 17 | + |
| 18 | +```bash |
| 19 | +cd examples/NeuralNetwork.NumSharp |
| 20 | +dotnet build -v q --nologo "-clp:NoSummary;ErrorsOnly" -p:WarningLevel=0 |
| 21 | +dotnet run --no-build --framework net8.0 # or --framework net10.0 |
| 22 | +``` |
| 23 | + |
| 24 | +The csproj is an **Exe** (not a library) with `OutputType=Exe`, |
| 25 | +`AllowUnsafeBlocks=true`, multi-targets `net8.0;net10.0`. It has |
| 26 | +`InternalsVisibleTo("NeuralNetwork.NumSharp")` in `src/NumSharp.Core/Assembly/ |
| 27 | +Properties.cs`, so `NpyIterRef`, `NpyExpr`, `ILKernelGenerator.InnerLoopCachedCount`, |
| 28 | +and `DelegateSlots.RegisteredCount` are all accessible. |
| 29 | + |
| 30 | +Current demo defaults (in `MnistMlp/Program.cs`): |
| 31 | +- `Epochs = 100`, `BatchSize = 128` |
| 32 | +- Adam lr=1e-3 |
| 33 | +- Synthetic-data noise sigma = 2.5 (in `MnistMlp/MnistLoader.cs`) |
| 34 | +- Test evaluation every `min(5, epochs)` epochs |
| 35 | + |
| 36 | +Place real MNIST at `examples/NeuralNetwork.NumSharp/data/`: |
| 37 | +- `train-images.idx3-ubyte`, `train-labels.idx1-ubyte` (60k train) |
| 38 | +- `t10k-images.idx3-ubyte`, `t10k-labels.idx1-ubyte` (10k test) |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +## Directory Map |
| 43 | + |
| 44 | +``` |
| 45 | +examples/NeuralNetwork.NumSharp/ |
| 46 | +├── NeuralNet.cs Sequential model (forward / backward / Train / |
| 47 | +│ Predict). Uses BaseLayer list + BaseCost + |
| 48 | +│ BaseOptimizer. Train now slices correctly. |
| 49 | +├── Util.cs int counter for layer-name uniqueness. |
| 50 | +│ |
| 51 | +├── Layers/ |
| 52 | +│ ├── BaseLayer.cs Abstract: Input, Output, Parameters["w"/"b"], |
| 53 | +│ │ Grads[...], InputGrad. Subclasses override |
| 54 | +│ │ Forward/Backward. |
| 55 | +│ ├── FullyConnected.cs Dense layer with bias + He/Xavier init (float32). |
| 56 | +│ │ Composes an optional BaseActivation by name. |
| 57 | +│ └── Activations/ |
| 58 | +│ ├── BaseActivation.cs Get(name): resolves "relu"/"sigmoid" by name. |
| 59 | +│ ├── ReLU.cs (NDArray > 0) * NDArray formulation (works). |
| 60 | +│ ├── Sigmoid.cs 1/(1+exp(-x)); Backward uses cached Output. |
| 61 | +│ └── Softmax.cs Numerically-stable row-wise softmax; |
| 62 | +│ Backward = Output * (grad - Σ(grad*Output, axis=1, keepdims)). |
| 63 | +│ |
| 64 | +├── Cost/ |
| 65 | +│ ├── BaseCost.cs Abstract: Forward, Backward, float Epsilon. |
| 66 | +│ ├── CategoricalCrossentropy.cs L = -Σ(y*log(clip(p))) / batch; |
| 67 | +│ │ dL/dp = -y / clip(p) / batch. |
| 68 | +│ ├── BinaryCrossEntropy.cs mean(-y*log(clip(p)) - (1-y)*log(1-clip(p))); |
| 69 | +│ │ dL/dp = (p - y) / (p*(1-p)) / N. |
| 70 | +│ └── MeanSquaredError.cs mean((preds - labels)²); ∇ = 2*(preds-labels)/batch. |
| 71 | +│ |
| 72 | +├── Metrics/ |
| 73 | +│ ├── BaseMetric.cs Abstract: Calculate(preds, labels) → NDArray. |
| 74 | +│ ├── Accuracy.cs class Accuacy (typo preserved). argmax(preds,1) |
| 75 | +│ │ == argmax(labels,1), mean. |
| 76 | +│ ├── BinaryAccuacy.cs round(clip(preds, 0, 1)) == labels, mean. |
| 77 | +│ └── MeanAbsoluteError.cs mean(|preds - labels|). |
| 78 | +│ |
| 79 | +├── Optimizers/ |
| 80 | +│ ├── BaseOptimizer.cs Abstract. Get("sgd") / Get("adam") resolvers. |
| 81 | +│ ├── SGD.cs Vanilla SGD; classical momentum; inverse-time |
| 82 | +│ │ LR decay. |
| 83 | +│ └── Adam.cs First/second moments with proper np.zeros init. |
| 84 | +│ Step counter must be monotonic across run. |
| 85 | +│ |
| 86 | +├── MnistMlp/ The runnable experiment. Files described below. |
| 87 | +│ |
| 88 | +├── Open.snk Strong-name key shared with NumSharp.Core. |
| 89 | +└── NeuralNetwork.NumSharp.csproj Exe, net8.0+net10.0, AllowUnsafeBlocks. |
| 90 | +``` |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +## MnistMlp — fused forward + backward |
| 95 | + |
| 96 | +All fusion happens in `FullyConnectedFused`. The idea: every post-matmul |
| 97 | +element-wise chunk (bias-add + ReLU, bias-add only, ReLU gradient mask) |
| 98 | +collapses into **one NpyIter kernel**, compiled once per process and |
| 99 | +cache-hit on every subsequent forward/backward pass. |
| 100 | + |
| 101 | +| Stage | NpyExpr tree | Inputs → Output | |
| 102 | +|---|---|---| |
| 103 | +| Forward ReLU | `Max(Input(0) + Input(1), Const(0f))` | (preact, bias) → y | |
| 104 | +| Forward linear | `Input(0) + Input(1)` | (preact, bias) → y | |
| 105 | +| Backward ReLU | `Input(0) * Greater(Input(1), Const(0f))` | (gradOut, y) → gradPreact | |
| 106 | +| Backward linear | — (pass-through) | gradOut → gradPreact | |
| 107 | + |
| 108 | +**`MnistMlp/` files:** |
| 109 | + |
| 110 | +| File | What it does | |
| 111 | +|---|---| |
| 112 | +| `Program.cs` | Entry point. Loads data, builds 2-FC model, runs fusion probe, trains via MlpTrainer, reports IL-kernel cache + delegate-slot counts. | |
| 113 | +| `MnistLoader.cs` | IDX parser (big-endian) + learnable synthetic fallback (shared class templates across train/test, sigma=2.5 noise). | |
| 114 | +| `FullyConnectedFused.cs` | FC with bias + optional fused activation. Three NpyIter kernels (two forward, one backward), cache keys are stable strings. | |
| 115 | +| `SoftmaxCrossEntropy.cs` | Combined loss — numerically stable softmax forward, cached softmax, (softmax-labels)/batch backward. Also ships `OneHot` helper. | |
| 116 | +| `MlpTrainer.cs` | Explicit train loop (`NeuralNet.Train` replacement). Periodic test eval (`min(5, epochs)` cadence). Returns per-epoch loss/train_acc + list of (epoch, test_acc) pairs. | |
| 117 | +| `FusedMlp.cs`, `NaiveMlp.cs` | Side-by-side forward implementations for the correctness probe at Program startup. | |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## Layer / Cost / Optimizer contract |
| 122 | + |
| 123 | +Every BaseLayer subclass MUST populate on Forward: |
| 124 | +- `this.Input = x` (via `base.Forward(x)`) |
| 125 | +- `this.Output = result` |
| 126 | + |
| 127 | +And on Backward: |
| 128 | +- `this.Grads[key] = ∂L/∂param` for every entry in `this.Parameters` |
| 129 | +- `this.InputGrad = ∂L/∂x` (consumed by the previous layer) |
| 130 | + |
| 131 | +Optimizers iterate `layer.Parameters.ToList()` and expect `layer.Grads[paramKey]` |
| 132 | +to be populated by Backward. Param-name convention is `"w"` / `"b"`. |
| 133 | + |
| 134 | +BaseCost contract: |
| 135 | +- `Forward(preds, labels)` → scalar NDArray (the loss) |
| 136 | +- `Backward(preds, labels)` → NDArray shape-matched to preds (the first |
| 137 | + incoming gradient for the network's output layer) |
| 138 | + |
| 139 | +BaseMetric contract: |
| 140 | +- `Calculate(preds, labels)` → scalar NDArray in [0, 1] |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +## Sharp edges that bit us |
| 145 | + |
| 146 | +### 1. np.dot + strided operands (historical) |
| 147 | +Before the stride-aware GEMM shipped in `f5c05a7f`, `np.dot(x.T, grad)` with |
| 148 | +non-contiguous operands was **~100x slower** than contiguous (240 ms vs 2.5 ms |
| 149 | +on the layer-1 backward shapes). Workaround was `.transpose().copy()` before |
| 150 | +the dot. Now removed — the stride-aware kernel handles transposed views |
| 151 | +directly and is ~1.4x slower than fully-contig (normal stride overhead). |
| 152 | +Don't add `.copy()` back. |
| 153 | + |
| 154 | +### 2. `x[i, j]` is 2-index element selection, NOT a slice |
| 155 | +`NeuralNet.Train` originally did `x[currentIndex, currentIndex + batchSize]` |
| 156 | +which read a single element, not a batch. Correct form: |
| 157 | +`x[$"{start}:{end}"]` — string-slicing the outer dim returns a view. |
| 158 | + |
| 159 | +### 3. `np.argmax(x)` without axis returns a scalar |
| 160 | +For batched predictions you need `axis: 1`. The metrics previously returned |
| 161 | +scalars that matched two scalar argmaxes — broken for batches. |
| 162 | + |
| 163 | +### 4. `np.allclose` mutates its arguments |
| 164 | +`np.allclose` calls `astype(Double, copy:false)` on both operands, which |
| 165 | +in-place flips their dtype from Single to Double. Use a manual max-abs-diff |
| 166 | +loop if you need the operands untouched. (This is a NumSharp core library |
| 167 | +bug — not fixed here.) |
| 168 | + |
| 169 | +### 5. `np.argmax(preds, axis:1)` returns Int64 |
| 170 | +When comparing against `labels.GetByte(i)` use `predIdx.GetInt64(i)` — |
| 171 | +calling `GetInt32` on Int64 storage throws `Memory corruption expected`. |
| 172 | + |
| 173 | +### 6. Adam step counter MUST be monotonic across the full run |
| 174 | +Don't reset per epoch. Adam's `1 - β^t` bias correction needs `t` to increase |
| 175 | +monotonically across the whole training run, otherwise the first batch of |
| 176 | +each epoch gets the same broken divisor (`1 - β^1` with β^1 close to β → |
| 177 | +large correction factor). |
| 178 | + |
| 179 | +### 7. FullyConnected weight init was `normal(0.5, 1, ...)` (wrong) |
| 180 | +Float64 dtype, mean=0.5. Now He-normal for ReLU, Xavier/Glorot otherwise, |
| 181 | +all float32. If you see the class still using that init, you're looking at |
| 182 | +a pre-fix checkout. |
| 183 | + |
| 184 | +### 8. Slice view dtype |
| 185 | +`images[$"0:{BatchSize}"]` preserves dtype. Feeding the slice directly to |
| 186 | +`np.dot` works. But the `np.dot` result dtype depends on input dtypes — |
| 187 | +float32 × float32 → float32, as expected. Use `.astype(NPTypeCode.Single)` |
| 188 | +after `np.random.normal(...)` which returns float64 by default. |
| 189 | + |
| 190 | +--- |
| 191 | + |
| 192 | +## Perf characteristics |
| 193 | + |
| 194 | +**100-epoch training on 6000 synthetic / 1000 test (batch=128, Adam, sigma=2.5):** |
| 195 | +- Epoch 1: loss ≈ 1.12, train_acc ≈ 73% (random init → partial fit) |
| 196 | +- Epoch 2: loss ≈ 0.009, train_acc ≈ 99.9% |
| 197 | +- Epoch 100: loss ≈ 0, test_acc ≈ 99.89% |
| 198 | +- Total training time: ~70 s (net8.0) |
| 199 | + |
| 200 | +**Fusion probe on post-matmul bias+ReLU, batch (128, 128) fp32:** |
| 201 | +- Fused (1 NpyIter): ~0.14 ms |
| 202 | +- Naive (np.add + np.maximum): ~0.36 ms |
| 203 | +- Speedup: ~2.5x |
| 204 | + |
| 205 | +**Instrumentation (after a 100-epoch run):** |
| 206 | +- IL kernel cache entries: delta of 6 (all unique fused expressions) |
| 207 | +- NpyExpr delegate slots: 0 (pure DSL, no captured lambdas) |
| 208 | + |
| 209 | +--- |
| 210 | + |
| 211 | +## Testing |
| 212 | + |
| 213 | +No dedicated MSTest project. The **smoke test** for the NN scaffolding lives |
| 214 | +in-line as a `dotnet run` stdin script — 29 checks covering: |
| 215 | +- Softmax forward + backward (finite-difference gradient check) |
| 216 | +- Sigmoid (saturation limits) |
| 217 | +- CCE / BCE (loss values + backward components) |
| 218 | +- Accuracy / BinaryAccuacy (argmax + round) |
| 219 | +- FullyConnected with bias (shape checks) |
| 220 | +- SGD vanilla + momentum (hand-computed trajectories) |
| 221 | +- `BaseOptimizer.Get("sgd")` / `Get("adam")` |
| 222 | + |
| 223 | +Run pattern for ad-hoc sanity checks: |
| 224 | +```bash |
| 225 | +cat /tmp/script.cs | dotnet_run |
| 226 | +``` |
| 227 | +where the script references the two projects via `#:project`. |
| 228 | + |
| 229 | +--- |
| 230 | + |
| 231 | +## Q&A |
| 232 | + |
| 233 | +**Why do we have both `FullyConnected` and `FullyConnectedFused`?** |
| 234 | +`FullyConnected` is the vanilla version that goes through `np.dot + (x + b) + |
| 235 | +activation` as separate ops. `FullyConnectedFused` collapses bias+activation |
| 236 | +into a single NpyIter — the fusion demo's point. Both share the BaseLayer |
| 237 | +contract and are interchangeable in a NeuralNet pipeline. |
| 238 | + |
| 239 | +**Why do the metric classes have typos in their names?** |
| 240 | +`Accuacy`, `BinaryAccuacy` — misspelled in the original scaffolding, kept |
| 241 | +for backward compat with any external caller. Fixing the implementation |
| 242 | +without renaming the class is the lower-risk path. |
| 243 | + |
| 244 | +**Why is SoftmaxCrossEntropy in `MnistMlp/` instead of `Cost/`?** |
| 245 | +It's the combined-form loss — assumes softmax is applied internally, not by |
| 246 | +a separate Softmax layer. The standalone `Softmax` + `CategoricalCrossentropy` |
| 247 | +chain still works and is numerically fine for most cases; SCE is faster and |
| 248 | +slightly more stable for the MLP demo's specific pipeline. |
| 249 | + |
| 250 | +**Is `NeuralNet.Train` usable now?** |
| 251 | +Yes — the slicing bug is fixed (uses `$"{start}:{end}"` string-slice) and |
| 252 | +the optimizer step counter is monotonic. But `MnistMlp/MlpTrainer.cs` is |
| 253 | +still the richer path (periodic test eval, per-epoch timing output). Use |
| 254 | +`NeuralNet` for simple cases, `MlpTrainer` when you want instrumentation. |
| 255 | + |
| 256 | +**Can we train on real MNIST?** |
| 257 | +Yes — drop the four IDX files into `examples/NeuralNetwork.NumSharp/data/`. |
| 258 | +The loader auto-detects and switches off synthetic. Real-MNIST accuracy |
| 259 | +with this 2-layer MLP should land ~97-98% after 10-20 epochs. |
| 260 | + |
| 261 | +--- |
| 262 | + |
| 263 | +## Known limitations |
| 264 | + |
| 265 | +- **No data shuffling.** `MlpTrainer` iterates batches in order. Works fine |
| 266 | + for synthetic data and MNIST (which is pre-shuffled) but would hurt |
| 267 | + generalization on ordered datasets. |
| 268 | +- **No validation split.** Train / test is a fixed split; no held-out |
| 269 | + validation for early stopping. |
| 270 | +- **Adam re-allocates per step.** Each Adam update allocates ~14 temp |
| 271 | + NDArrays per parameter. For a 2-layer FC this is ~200 ms/epoch of GC |
| 272 | + pressure. Fixable by fusing Adam's update into NpyIter like the rest, |
| 273 | + but out of scope for the current demo. |
| 274 | +- **No model serialization.** Parameters can't be saved / loaded yet. |
| 275 | +- **Activation resolution by string only.** `FullyConnected` takes `act = |
| 276 | + "relu"` etc. `FullyConnectedFused` uses an enum (`FusedActivation`) — |
| 277 | + the two are slightly inconsistent. |
0 commit comments