Skip to content

Commit b86b348

Browse files
committed
refactor(iterators): NDIterator fully backed by NpyIter state
Replaces the lazy-but-standalone ValueOffsetIncrementor path with one that constructs an NpyIter state and drives MoveNext / HasNext / Reset directly off that state. NDIterator is now an honest thin wrapper over NpyIter — the same traversal machinery used by all the Phase 2 production call sites — rather than reimplementing the coord-walk logic with legacy incrementors. How it works ------------ - ctor calls NpyIterRef.New(arr, NPY_CORDER) to build the state, then transfers ownership of the NpyIterState* pointer out of the ref struct (see NpyIterRef.ReleaseState / FreeState below). The class holds that pointer for its lifetime and frees it in Dispose (or in the finalizer as a safety net). - MoveNext reads `*(TOut*)state->DataPtrs[0]` then calls `state->Advance()`. IterIndex tracks position, IterEnd bounds the non-AutoReset case, and `state->Reset()` restarts from IterStart on AutoReset wraparound and on explicit Reset. - Cross-dtype wraps the same read with a Converts.FindConverter<TSrc, TOut> lookup — one switch at construction picks the typed helper, so the per-element hot path is still just one read + one converter delegate call. MoveNextReference throws when casting is in play, matching the legacy contract. - NPY_CORDER is explicit so iterating a transposed view yields the logical row-major order the old NDIterator provided. Without it, KEEPORDER would give memory-efficient order (which e.g. `b.T.AsIterator<int>()` would surface as `0 1 2 ... 11` instead of the expected `0 4 8 1 5 9 2 6 10 3 7 11`). NpyIter additions ----------------- - NpyIterRef.ReleaseState(): hand the owned NpyIterState* to a caller who needs it across a non-ref-struct boundary (e.g. a class field). Marks the ref struct as non-owning so its Dispose is a no-op. - NpyIterRef.FreeState(NpyIterState*): static tear-down mirror of Dispose's cleanup path — frees buffers (when BUFFER set), calls FreeDimArrays, and NativeMemory.Free's the state pointer. The long-lived owner calls this from its own Dispose/finalizer. Bug fixes along the way ----------------------- NpyIter initialization previously computed base pointers as `(byte*)arr.Address + (shape.offset * arr.dtypesize)` in two places (initial broadcast setup on line 340 and ResetBasePointers on line 1972). `arr.dtypesize` goes through `Marshal.SizeOf(bool) == 4` because bool is marshaled to win32 BOOL, but the in-memory `bool[]` storage is 1 byte per element. For strided bool arrays this produced a base pointer 4× too far into the buffer. Switched both sites to `arr.GetTypeCode.SizeOf()` which returns the actual in-memory size (1 for bool). Surfaced by `Boolean_Strided_Odd` once NDIterator started routing through NpyIter — previously only LATENT because the legacy NDIterator path computed offsets in element units, not bytes, and sidestepped the NpyIter init. Test impact: 6,748 / 6,748 passing on net8.0 and net10.0 (CI filter: TestCategory!=OpenBugs&TestCategory!=HighMemory). Smoke test of same-type contig / cross-type / strided / transposed / broadcast / AutoReset / Reset / foreach all produce the expected element sequences.
1 parent fb4b7dc commit b86b348

8 files changed

Lines changed: 3740 additions & 170 deletions

File tree

docs/DEFAULTENGINE_ILKERNEL_PLAYBOOK.md

Lines changed: 407 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# DefaultEngine + ILKernelGenerator Rulebook
2+
3+
This document captures the implicit implementation rules currently used across `DefaultEngine` and `ILKernelGenerator`.
4+
5+
Scope:
6+
- `src/NumSharp.Core/Backends/Default/*`
7+
- `src/NumSharp.Core/Backends/Kernels/ILKernelGenerator*.cs`
8+
- `src/NumSharp.Core/View/Shape*.cs`
9+
10+
## 1) Ownership and call boundaries
11+
12+
- `ILKernelGenerator` is backend infrastructure; access should flow through `TensorEngine` / `DefaultEngine`, not directly from top-level APIs.
13+
- See: `src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs` (class summary and architecture comments).
14+
- `DefaultEngine` owns high-level semantics (dtype rules, shape/broadcast behavior, keepdims, edge cases); kernels own tight loops.
15+
16+
## 2) Standard dispatch pipeline (elementwise ops)
17+
18+
For binary/unary/comparison operations, the repeated flow is:
19+
20+
1. Resolve dtype semantics first.
21+
2. Handle scalar/scalar fast path.
22+
3. Broadcast or normalize shapes.
23+
4. Allocate contiguous output shape (`Shape.Clean()` / fresh `Shape` from dims).
24+
5. Classify execution path (contiguous / scalar-broadcast / chunk / general).
25+
6. Build kernel key.
26+
7. Get-or-generate kernel from cache.
27+
8. Execute kernel with pointer + strides + shape.
28+
9. Use fallback path or throw explicit `NotSupportedException` if kernel unavailable.
29+
30+
Primary references:
31+
- `src/NumSharp.Core/Backends/Default/Math/DefaultEngine.BinaryOp.cs`
32+
- `src/NumSharp.Core/Backends/Default/Math/DefaultEngine.UnaryOp.cs`
33+
- `src/NumSharp.Core/Backends/Default/Math/DefaultEngine.CompareOp.cs`
34+
35+
## 3) Dtype rules are explicit and front-loaded
36+
37+
- Binary ops use `np._FindCommonType(lhs, rhs)` as the baseline promotion.
38+
- `DefaultEngine.BinaryOp.cs`
39+
- True division on non-float common types is forced to `float64` (`NPTypeCode.Double`).
40+
- `DefaultEngine.BinaryOp.cs`
41+
- Unary math promotion goes through `ResolveUnaryReturnType` / `GetComputingType`, while selected ops intentionally preserve input type (`Negate`, `Abs`, `LogicalNot`).
42+
- `DefaultEngine.UnaryOp.cs`
43+
- `DefaultEngine.ResolveUnaryReturnType.cs`
44+
- Reductions use accumulator type decisions up front (for example `GetAccumulatingType`, std/var double output path in axis kernels).
45+
- `DefaultEngine.ReductionOp.cs`
46+
- `Default.Reduction.Var.cs`
47+
- `Default.Reduction.Std.cs`
48+
49+
## 4) Shape/offset correctness is non-negotiable
50+
51+
- Kernel inputs must include shape-offset-adjusted base pointers for sliced views:
52+
- `base = Address + shape.offset * dtypesize`
53+
- `DefaultEngine.BinaryOp.cs`
54+
- `DefaultEngine.UnaryOp.cs`
55+
- `DefaultEngine.ReductionOp.cs`
56+
- Output arrays are usually allocated as contiguous clean shapes.
57+
- Broadcast semantics rely on stride-0 dimensions and read-only protection at shape-level flags.
58+
- `src/NumSharp.Core/View/Shape.cs`
59+
- `src/NumSharp.Core/View/Shape.Broadcasting.cs`
60+
61+
## 5) Execution-path model
62+
63+
The core path taxonomy is:
64+
- `SimdFull`: fully contiguous
65+
- `SimdScalarRight` / `SimdScalarLeft`: one operand broadcast scalar
66+
- `SimdChunk`: inner dimension contiguous/broadcast
67+
- `General`: arbitrary strides
68+
69+
References:
70+
- `src/NumSharp.Core/Backends/Kernels/StrideDetector.cs`
71+
- `src/NumSharp.Core/Backends/Default/Math/DefaultEngine.BinaryOp.cs`
72+
- `src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.MixedType.cs`
73+
74+
Important current caveat:
75+
- `MixedType` `SimdChunk` currently emits the general loop (`TODO` placeholder), not true chunked SIMD.
76+
- `ILKernelGenerator.MixedType.cs`
77+
- Comparison `SimdChunk` intentionally falls through to general path.
78+
- `ILKernelGenerator.Comparison.cs`
79+
80+
## 6) Kernel-key and cache conventions
81+
82+
- Kernels are cached by keys that encode everything affecting generated IL (types, op, path, contiguity).
83+
- Caches are `ConcurrentDictionary<key, delegate>`.
84+
- Standard retrieval API: `Get*Kernel` and `TryGet*Kernel`.
85+
- `TryGet*` methods are intentionally catch-all and return `null` to allow graceful fallback.
86+
87+
References:
88+
- `ILKernelGenerator.cs` (exception-handling design notes)
89+
- `ILKernelGenerator.MixedType.cs`
90+
- `ILKernelGenerator.Unary.cs`
91+
- `ILKernelGenerator.Comparison.cs`
92+
- `ILKernelGenerator.Reduction.cs`
93+
94+
## 7) SIMD policy and loop shape
95+
96+
- SIMD is only enabled for explicitly supported type/op combinations.
97+
- `CanUseSimd(NPTypeCode)` excludes `Boolean`, `Char`, `Decimal`.
98+
- `ILKernelGenerator.cs`
99+
- Mixed-type SIMD requires additional constraints (often same-type for vectorized path or no per-element conversion).
100+
- `ILKernelGenerator.MixedType.cs`
101+
- Typical contiguous loop form:
102+
- 4x unrolled SIMD block
103+
- remainder SIMD block
104+
- scalar tail
105+
- `ILKernelGenerator.Binary.cs`
106+
- `ILKernelGenerator.Unary.cs`
107+
- `ILKernelGenerator.Reduction.cs`
108+
109+
## 8) Scalar fast paths avoid boxing
110+
111+
- Scalar-scalar ops dispatch through typed delegates with exhaustive NPTypeCode switches.
112+
- Pattern is nested type dispatch (lhs -> rhs -> result) rather than object/boxed conversion.
113+
114+
References:
115+
- `DefaultEngine.BinaryOp.cs`
116+
- `DefaultEngine.UnaryOp.cs`
117+
- `DefaultEngine.CompareOp.cs`
118+
119+
## 9) Reduction-specific conventions
120+
121+
- Elementwise reductions:
122+
- empty input returns op identity (or op-specific behavior at higher level),
123+
- scalar short-circuit,
124+
- contiguous kernel path, strided fallback.
125+
- `DefaultEngine.ReductionOp.cs`
126+
- Axis reductions:
127+
- output dims computed by removing axis,
128+
- SIMD path usually constrained to inner-contiguous axis for fast case,
129+
- keepdims reshapes handled at engine level after reduction.
130+
- `DefaultEngine.ReductionOp.cs`
131+
- `var` / `std` axis kernels compute ddof=0 baseline, then apply ddof correction in engine.
132+
- `Default.Reduction.Var.cs`
133+
- `Default.Reduction.Std.cs`
134+
135+
## 10) NaN-aware behavior uses dedicated logic
136+
137+
- NaN reductions are float/double-specific; non-float types delegate to regular reductions.
138+
- For contiguous float/double inputs, dedicated NaN SIMD helpers are used; scalar iterator fallback otherwise.
139+
- keepdims reshaping is handled explicitly after scalar/elementwise NaN reductions.
140+
141+
Reference:
142+
- `src/NumSharp.Core/Backends/Default/Math/Reduction/Default.Reduction.Nan.cs`
143+
144+
## 11) General-path philosophy
145+
146+
- General path prioritizes correctness for non-contiguous, sliced, and broadcast layouts.
147+
- Coordinate-based offset computation is acceptable when required by arbitrary strides.
148+
- For complex cases (broadcast + views + type conversion), correctness path should remain available even when fast path exists.
149+
150+
Representative references:
151+
- `ILKernelGenerator.MixedType.cs` (`EmitGeneralLoop`)
152+
- `Default.ClipNDArray.cs` (contiguous fast path + general path split)
153+
- `Default.Reduction.CumAdd.cs`
154+
- `Default.Reduction.CumMul.cs`
155+
156+
## 12) Practical checklist for adding a new core operation
157+
158+
Before merge, verify all of the following:
159+
160+
- NumPy behavior matrix captured first (dtype promotion + edge cases).
161+
- Scalar-scalar behavior implemented and tested.
162+
- Contiguous fast path exists where meaningful.
163+
- Non-contiguous and sliced views work (`shape.offset`, strides).
164+
- Broadcast dimensions (stride=0) are handled correctly.
165+
- Output shape/layout rules match NumPy behavior.
166+
- All supported NumSharp dtypes are either implemented or explicitly rejected.
167+
- Keepdims / axis / negative-axis behavior is explicitly tested.
168+
- Empty-array behavior is explicit (identity / NaN / exception, as appropriate).
169+
- Kernel key includes all generation-sensitive dimensions (types/op/path/flags).
170+
- `TryGet*` fallback behavior is deterministic and test-covered.
171+
- Tests use actual NumPy output as source of truth.
172+
173+
## 13) Current technical debt markers (worth tracking)
174+
175+
- True chunked SIMD emission for mixed-type `SimdChunk` path is not implemented yet.
176+
- Comparison `SimdChunk` currently routes to general kernel.
177+
- Some comments indicate ownership/history items (for example cache-clear ownership) that should be periodically validated against current code.

0 commit comments

Comments
 (0)