Skip to content

Commit 1387d83

Browse files
authored
Merge pull request #139 from AdaWorldAPI/claude/review-rustynum-optimizations-xHtZI
docs: add rustynum integration plan with compatibility analysis
2 parents e818cc7 + 9c96c13 commit 1387d83

1 file changed

Lines changed: 324 additions & 0 deletions

File tree

PLAN-RUSTYNUM-INTEGRATION.md

Lines changed: 324 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
# Ladybug-RS × RustyNum Integration Plan
2+
3+
## Pre-Implementation Findings
4+
5+
### Q1: Can they compile into the same binary?
6+
7+
**Yes, with one blocker that requires a fix first.**
8+
9+
| Property | ladybug-rs | rustynum | Compatible? |
10+
|---|---|---|---|
11+
| Edition | 2024 | 2021 | Yes (2024 builds 2021 deps) |
12+
| Rust version | 1.88+ (stable) | nightly (`portable_simd`) | **BLOCKER** |
13+
| Arrow | 57 | 57 | Yes (exact match) |
14+
| DataFusion | 51 | 51 (optional) | Yes |
15+
| Lance | 2.0 | 2.0 (optional) | Yes |
16+
| tokio | 1.49 | 1.x (optional) | Yes |
17+
| rand | 0.9 | 0.8 (oracle only) | Minor conflict |
18+
| SIMD approach | `std::arch` intrinsics | `portable_simd` feature | Different layers |
19+
| Alignment | 64-byte (`repr(align(64))`) | 64-byte (alloc with ALIGNMENT=64) | Yes |
20+
| External deps | arrow/datafusion/lance/rayon | zero (core+blas+mkl+holo) | No conflicts |
21+
22+
**The blocker**: rustynum uses `#![feature(portable_simd)]` in 5 crates (core, rs, blas, mkl, archive). This requires nightly Rust. Ladybug-rs compiles on stable 1.93.
23+
24+
**Fix**: Replace `portable_simd` with `std::arch` intrinsics (same approach ladybug already uses). The rustynum SIMD code already uses explicit AVX-512/AVX2 intrinsics in many hot paths — `portable_simd` is used mainly for convenience types (`f32x16`, `u8x64`). These can be rewritten as `__m512`/`__m256` operations, which work on stable Rust.
25+
26+
**Alternative**: Add `rust-toolchain.toml` to ladybug-rs specifying nightly. Faster to ship but forces all ladybug contributors to nightly.
27+
28+
**Recommendation**: Phase 0 (below) rewrites rustynum to stable `std::arch`. This is a one-time cost that unblocks all downstream users.
29+
30+
### Q2: Does BindSpace + Blackboard borrow-mut work with zero-copy?
31+
32+
**Yes, they operate at different levels and compose naturally.**
33+
34+
| Property | BindSpace | Blackboard |
35+
|---|---|---|
36+
| Purpose | 65K-address cognitive memory | SIMD-aligned computation arena |
37+
| Granularity | Per-fingerprint (2KB each) | Per-named-buffer (arbitrary size) |
38+
| Addressing | `Addr(u16)` → array index (3-5 cycles) | String name → HashMap → raw ptr |
39+
| Ownership | Owns `BindNode` structs (fingerprint + metadata) | Owns raw byte allocations |
40+
| Borrow model | `&mut self` for write, `&self` for read | Split-borrow: multiple `&mut [T]` from `&self` |
41+
| Thread safety | `Send` (single-owner) | `Send` (single-owner, unsafe interior mut) |
42+
| Memory layout | `[u64; 256]` per fingerprint + metadata fields | 64-byte aligned contiguous buffers |
43+
44+
**How they compose (zero-copy path):**
45+
46+
```
47+
BindSpace (owns fingerprints)
48+
49+
├── node.fingerprint: [u64; 256] ← 2KB, 64-byte aligned (repr(align(64)))
50+
51+
│ For batch BLAS/VML operations:
52+
53+
├── Blackboard::alloc_u8("query", 2048) ← allocate scratch buffer
54+
├── copy query fingerprint bytes into Blackboard buffer (one memcpy)
55+
56+
├── rustyblas::int8_gemm() / rustymkl::vsexp() ← operate on Blackboard buffers
57+
58+
└── read results back from Blackboard ← pointer read, no copy
59+
```
60+
61+
**Key insight**: BindSpace fingerprints are `[u64; 256]` = 2048 bytes = exactly the CogRecord container size in rustynum. The Blackboard's `alloc_u8("containers", N * 2048)` can hold a batch of fingerprints for SIMD bulk operations, then results are read back.
62+
63+
**The split-borrow is critical for GEMM-style ops**: `borrow_3_mut_f32("A", "B", "C")` gives three non-aliasing mutable slices from a shared `&self`, which Rust's normal borrow checker can't do with a single struct. This is exactly what's needed for batch Hamming (query in A, corpus chunk in B, distances in C).
64+
65+
**No architectural conflict.** The two systems serve different purposes:
66+
- BindSpace = persistent cognitive addressing (owns the data)
67+
- Blackboard = transient SIMD compute scratch (borrows/copies for computation)
68+
69+
For truly zero-copy batch operations, a thin adapter can provide `&[u8]` views of BindSpace fingerprint ranges directly to rustynum functions that take slices (most of rustyblas level-1 and rustynum-holo phase ops). No Blackboard needed for slice-based APIs.
70+
71+
---
72+
73+
## Phase 0 — Stable Rust Port (prerequisite)
74+
75+
**Goal**: Make rustynum compile on stable Rust so it can be a dependency of ladybug-rs.
76+
77+
### 0.1 Replace `portable_simd` with `std::arch` intrinsics
78+
79+
Files requiring changes:
80+
- `rustynum-core/src/lib.rs` — remove `#![feature(portable_simd)]`
81+
- `rustynum-core/src/simd.rs` — rewrite `Simd<f32, 16>``__m512` with `_mm512_*`
82+
- `rustynum-rs/src/lib.rs` — remove feature gate
83+
- `rustynum-rs/src/simd_ops.rs` — rewrite portable SIMD ops to `std::arch`
84+
- `rustyblas/src/lib.rs` — remove feature gate
85+
- `rustyblas/src/level1.rs` — sdot/ddot/saxpy etc. already use raw intrinsics in hot paths
86+
- `rustyblas/src/level3.rs` — microkernels already use `_mm512_*` intrinsics
87+
- `rustymkl/src/lib.rs` — remove feature gate
88+
- `rustymkl/src/vml.rs` — vsexp/vsln etc. use `_mm512_*` (already stable-compatible)
89+
- `rustynum-archive/src/lib.rs` — remove feature gate
90+
91+
**Estimate**: The actual SIMD kernels already use `std::arch` intrinsics. `portable_simd` is used for:
92+
1. Type aliases (`f32x16`, `u8x64`) — replace with `__m512`, `__m512i`
93+
2. Convenience ops (`.reduce_sum()`) — replace with `_mm512_reduce_add_ps()`
94+
3. `Simd::from_slice()` — replace with `_mm512_loadu_ps()`
95+
96+
Most of rustyblas/rustymkl hot paths are already `std::arch`. This is primarily a cleanup.
97+
98+
### 0.2 Add path dependencies to ladybug-rs
99+
100+
```toml
101+
# In ladybug-rs/Cargo.toml [dependencies]
102+
rustynum-core = { path = "../rustynum/rustynum-core", default-features = false, features = ["avx512"], optional = true }
103+
rustyblas = { path = "../rustynum/rustyblas", default-features = false, features = ["avx512"], optional = true }
104+
rustymkl = { path = "../rustynum/rustymkl", default-features = false, features = ["avx512"], optional = true }
105+
rustynum-rs = { path = "../rustynum/rustynum-rs", optional = true }
106+
rustynum-holo = { path = "../rustynum/rustynum-holo", default-features = false, features = ["avx512"], optional = true }
107+
108+
# Feature gate
109+
[features]
110+
rustynum = ["rustynum-core", "rustyblas", "rustymkl", "rustynum-rs", "rustynum-holo"]
111+
full = [..., "rustynum"]
112+
```
113+
114+
### 0.3 Verify compilation
115+
116+
```bash
117+
cd ladybug-rs
118+
cargo check --features rustynum # must pass on stable
119+
cargo test --features rustynum # must pass
120+
```
121+
122+
---
123+
124+
## Phase 1 — Drop-In HDC Acceleration
125+
126+
**Goal**: Replace ladybug's scalar/per-bit loops with rustynum's SIMD-vectorized equivalents.
127+
128+
### 1.1 Bundle acceleration (highest impact)
129+
130+
**Current** (`src/core/vsa.rs:55-93`): Bit-by-bit counting loop — O(N × 16384) with branch per bit.
131+
```rust
132+
// Current: 16384 iterations × N items × branch per bit
133+
let mut counts = [0u32; FINGERPRINT_U64 * 64];
134+
for item in items {
135+
for (word_idx, &word) in item.as_raw().iter().enumerate() {
136+
for bit in 0..64 {
137+
if (word >> bit) & 1 == 1 {
138+
counts[word_idx * 64 + bit] += 1;
139+
}
140+
}
141+
}
142+
}
143+
```
144+
145+
**Replace with**: rustynum-rs `CogRecord::bundle()` which uses SIMD ripple-carry majority voting. Expected 17× speedup.
146+
147+
**Implementation**: Add `fn bundle_simd(items: &[Fingerprint]) -> Fingerprint` in `src/core/vsa.rs` gated on `#[cfg(feature = "rustynum")]`, delegating to rustynum's bundle. The existing scalar path remains as fallback.
148+
149+
### 1.2 Bind/XOR acceleration
150+
151+
**Current** (`src/core/fingerprint.rs:151-157`): Scalar loop over 256 words.
152+
```rust
153+
for i in 0..FINGERPRINT_U64 {
154+
result[i] = self.data[i] ^ other.data[i];
155+
}
156+
```
157+
158+
**Replace with**: rustynum-core SIMD XOR (processes 64 bytes per instruction on AVX-512 = 4 iterations instead of 256). Expected 8-16× speedup.
159+
160+
### 1.3 Permute acceleration
161+
162+
**Current** (`src/core/fingerprint.rs:167-191`): Bit-by-bit rotation — O(16384) with get_bit/set_bit per position.
163+
164+
**Replace with**: Word-level rotation with carry (32 iterations on AVX-512). Expected 50-100× speedup.
165+
166+
### 1.4 Popcount acceleration
167+
168+
**Current** (`src/core/fingerprint.rs:110-112`): `iter().map(|x| x.count_ones()).sum()` — good but not SIMD-vectorized.
169+
170+
**Replace with**: rustynum VPOPCNTDQ path (same as ladybug's `simd.rs` but unified). The ladybug `simd.rs` AVX-512 Hamming is already excellent — the win here is unification, not speedup.
171+
172+
---
173+
174+
## Phase 2 — HDR Cascade Pre-Stage
175+
176+
**Goal**: Add rustynum's INT8 quantization and prefilter as an optional cascade stage.
177+
178+
### 2.1 INT8 sketch stage for HDR cascade
179+
180+
**Current** (`src/search/hdr_cascade.rs`): 4-level cascade (1-bit → 4-bit → 8-bit → full popcount), all scalar.
181+
182+
**Add**: INT8 quantized pre-stage using rustyblas `int8_gemm_i32`.
183+
184+
```
185+
New cascade:
186+
L-1: INT8 batch distance (VNNI vpdpbusd, 64 MACs/instruction) ← NEW
187+
L0: 1-bit sketch (existing)
188+
L1: 4-bit sketch (existing)
189+
L2: 8-bit sketch (existing)
190+
L3: Full popcount (existing)
191+
```
192+
193+
**How**: Quantize fingerprint chunks to i8 vectors, compute batch dot products via VNNI. Candidates below threshold skip to L0. This gives 4× throughput improvement for the initial filtering stage.
194+
195+
### 2.2 Batch Hamming via Blackboard
196+
197+
**Current** (`src/core/simd.rs:197-213`): Per-pair Hamming distance, parallelized with rayon.
198+
199+
**Replace with**: Blackboard-based batch processing. Allocate corpus chunk in Blackboard, run rustynum's parallel_for_chunks() with SIMD Hamming. Avoids rayon overhead for small batches and exploits cache locality for large batches.
200+
201+
---
202+
203+
## Phase 3 — Statistics & VML
204+
205+
**Goal**: Replace manual math with SIMD-accelerated transcendentals and statistics.
206+
207+
### 3.1 VML for truth value computation
208+
209+
**Target**: `src/nars/truth.rs` — NARS truth value functions (frequency × confidence).
210+
211+
Currently scalar `f32` operations. Batch truth evaluation across many edges can use rustymkl VML:
212+
- `vsexp()` for exponential decay
213+
- `vsln()` for log-evidence
214+
- `vssqrt()` for confidence intervals
215+
216+
### 3.2 Statistics for temporal search
217+
218+
**Target**: `src/search/temporal.rs` — autocorrelation, cross-correlation, variance.
219+
220+
Replace manual variance/stddev with rustynum-rs statistics (SIMD-accelerated mean, std, variance).
221+
222+
### 3.3 VML for spectroscopy
223+
224+
**Target**: `src/container/spectroscopy/` — frequency analysis.
225+
226+
Replace scalar log/sqrt/sin/cos with rustymkl VML batch operations. 16-wide f32 SIMD instead of one-at-a-time.
227+
228+
---
229+
230+
## Phase 4 — Holographic Unification
231+
232+
**Goal**: Connect ladybug's hologram extensions to rustynum-holo's principled implementations.
233+
234+
### 4.1 Phase-space ops for quantum_field.rs
235+
236+
**Target**: `src/extensions/hologram/quantum_field.rs`
237+
238+
Replace ladybug's PhaseTag-based operations with rustynum-holo's phase-space primitives:
239+
- `phase_bind_i8()` / `phase_unbind_i8()` — reversible ADD/SUB mod 256
240+
- `wasserstein_sorted_i8()` — Earth Mover's distance (new capability, not in ladybug)
241+
- `circular_distance_i8()` — wrap-around distance for unsorted vectors
242+
243+
### 4.2 Carrier waveform for embedding encoding
244+
245+
Ladybug's fingerprint→embedding pipeline can use rustynum-holo carrier encoding:
246+
- `carrier_encode()` — frequency-domain concept encoding with VNNI acceleration
247+
- `carrier_decode()` — demodulation via dot product
248+
- Fibonacci spacing avoids harmonic interference, enables ~16-item bundling
249+
250+
### 4.3 Focus gating for scent extraction
251+
252+
Replace ladybug's 5-byte "flavor" extraction (`src/core/scent.rs`) with rustynum-holo's principled focus-of-attention:
253+
- 3D geometric attention (8×8×32 volume)
254+
- 48-bit masks for non-overlapping concept allocation
255+
- `focus_xor()`, `focus_read()` — gated operations
256+
257+
### 4.4 Gabor wavelets for hologram extensions
258+
259+
Rustynum-holo's Gabor wavelet system subsumes phase+carrier+focus+5D projection into spatially-localized frequency encoding. This is the eventual target for ladybug's hologram extension modules.
260+
261+
### 4.5 Organic X-Trans for write pipeline
262+
263+
Ladybug's separate write → clean → learn pipeline can be replaced by rustynum-oracle's organic model where write=clean=learn in one pass, using X-Trans Fibonacci sampling.
264+
265+
---
266+
267+
## Phase 5 — Foundations (LAPACK/FFT/GEMM)
268+
269+
**Goal**: Use rustymkl for any dense linear algebra ladybug needs.
270+
271+
### 5.1 LAPACK QR for orthogonalization
272+
273+
Any Gram-Schmidt operations in learning paths can use `sgeqrf`/`dgeqrf` from rustymkl.
274+
275+
### 5.2 FFT for spectroscopy
276+
277+
`fft_f32`/`fft_f64` from rustymkl replaces any DFT needs in spectroscopy or frequency analysis.
278+
279+
### 5.3 GEMM for dense matrix operations
280+
281+
If ladybug ever needs dense matrix multiply (e.g., batch embedding transforms), rustyblas `sgemm` delivers 115 GFLOPS at 1024×1024.
282+
283+
---
284+
285+
## Implementation Order & Dependencies
286+
287+
```
288+
Phase 0.1 (stable port) ← PREREQUISITE for everything
289+
290+
Phase 0.2 (Cargo.toml wiring)
291+
292+
├── Phase 1.1 (bundle) ← highest user-visible impact
293+
├── Phase 1.2 (bind/xor) ← simple, high frequency
294+
├── Phase 1.3 (permute) ← moderate impact
295+
└── Phase 1.4 (popcount) ← unification, not speedup
296+
297+
├── Phase 2.1 (INT8 cascade) ← search throughput
298+
├── Phase 2.2 (batch hamming) ← memory efficiency
299+
300+
├── Phase 3.1 (VML truth) ← NARS acceleration
301+
├── Phase 3.2 (temporal stats) ← search quality
302+
└── Phase 3.3 (VML spectro) ← analysis speed
303+
304+
├── Phase 4.1-4.5 (holographic) ← deep integration
305+
└── Phase 5.1-5.3 (foundations) ← as-needed
306+
```
307+
308+
## Testing Strategy
309+
310+
Each phase must:
311+
1. Run all existing ladybug-rs tests (`cargo test`) — no regressions
312+
2. Add `#[cfg(feature = "rustynum")]` + `#[cfg(not(feature = "rustynum"))]` dual paths
313+
3. Add comparative benchmarks (existing vs rustynum) in `benches/`
314+
4. Verify SIMD correctness: scalar fallback must produce identical results
315+
316+
## Risk Assessment
317+
318+
| Risk | Likelihood | Impact | Mitigation |
319+
|---|---|---|---|
320+
| `portable_simd` removal breaks rustynum tests | Medium | High | Run full rustynum test suite after port |
321+
| Fingerprint size mismatch (16K vs 2048-byte CogRecord) | Low | Medium | Adapt at boundary: FP = 2 × CogRecord containers |
322+
| Nightly-only users of rustynum break | Low | Low | Keep nightly feature gate as optional |
323+
| Arrow version drift | Low | Medium | Both at 57 now; pin together |
324+
| rand 0.8 vs 0.9 conflict | Low | Low | Update rustynum-oracle to rand 0.9 |

0 commit comments

Comments
 (0)