|
| 1 | +# Ladybug-RS × RustyNum Integration Plan |
| 2 | + |
| 3 | +## Pre-Implementation Findings |
| 4 | + |
| 5 | +### Q1: Can they compile into the same binary? |
| 6 | + |
| 7 | +**Yes, with one blocker that requires a fix first.** |
| 8 | + |
| 9 | +| Property | ladybug-rs | rustynum | Compatible? | |
| 10 | +|---|---|---|---| |
| 11 | +| Edition | 2024 | 2021 | Yes (2024 builds 2021 deps) | |
| 12 | +| Rust version | 1.88+ (stable) | nightly (`portable_simd`) | **BLOCKER** | |
| 13 | +| Arrow | 57 | 57 | Yes (exact match) | |
| 14 | +| DataFusion | 51 | 51 (optional) | Yes | |
| 15 | +| Lance | 2.0 | 2.0 (optional) | Yes | |
| 16 | +| tokio | 1.49 | 1.x (optional) | Yes | |
| 17 | +| rand | 0.9 | 0.8 (oracle only) | Minor conflict | |
| 18 | +| SIMD approach | `std::arch` intrinsics | `portable_simd` feature | Different layers | |
| 19 | +| Alignment | 64-byte (`repr(align(64))`) | 64-byte (alloc with ALIGNMENT=64) | Yes | |
| 20 | +| External deps | arrow/datafusion/lance/rayon | zero (core+blas+mkl+holo) | No conflicts | |
| 21 | + |
| 22 | +**The blocker**: rustynum uses `#![feature(portable_simd)]` in 5 crates (core, rs, blas, mkl, archive). This requires nightly Rust. Ladybug-rs compiles on stable 1.93. |
| 23 | + |
| 24 | +**Fix**: Replace `portable_simd` with `std::arch` intrinsics (same approach ladybug already uses). The rustynum SIMD code already uses explicit AVX-512/AVX2 intrinsics in many hot paths — `portable_simd` is used mainly for convenience types (`f32x16`, `u8x64`). These can be rewritten as `__m512`/`__m256` operations, which work on stable Rust. |
| 25 | + |
| 26 | +**Alternative**: Add `rust-toolchain.toml` to ladybug-rs specifying nightly. Faster to ship but forces all ladybug contributors to nightly. |
| 27 | + |
| 28 | +**Recommendation**: Phase 0 (below) rewrites rustynum to stable `std::arch`. This is a one-time cost that unblocks all downstream users. |
| 29 | + |
| 30 | +### Q2: Does BindSpace + Blackboard borrow-mut work with zero-copy? |
| 31 | + |
| 32 | +**Yes, they operate at different levels and compose naturally.** |
| 33 | + |
| 34 | +| Property | BindSpace | Blackboard | |
| 35 | +|---|---|---| |
| 36 | +| Purpose | 65K-address cognitive memory | SIMD-aligned computation arena | |
| 37 | +| Granularity | Per-fingerprint (2KB each) | Per-named-buffer (arbitrary size) | |
| 38 | +| Addressing | `Addr(u16)` → array index (3-5 cycles) | String name → HashMap → raw ptr | |
| 39 | +| Ownership | Owns `BindNode` structs (fingerprint + metadata) | Owns raw byte allocations | |
| 40 | +| Borrow model | `&mut self` for write, `&self` for read | Split-borrow: multiple `&mut [T]` from `&self` | |
| 41 | +| Thread safety | `Send` (single-owner) | `Send` (single-owner, unsafe interior mut) | |
| 42 | +| Memory layout | `[u64; 256]` per fingerprint + metadata fields | 64-byte aligned contiguous buffers | |
| 43 | + |
| 44 | +**How they compose (zero-copy path):** |
| 45 | + |
| 46 | +``` |
| 47 | +BindSpace (owns fingerprints) |
| 48 | + │ |
| 49 | + ├── node.fingerprint: [u64; 256] ← 2KB, 64-byte aligned (repr(align(64))) |
| 50 | + │ |
| 51 | + │ For batch BLAS/VML operations: |
| 52 | + │ |
| 53 | + ├── Blackboard::alloc_u8("query", 2048) ← allocate scratch buffer |
| 54 | + ├── copy query fingerprint bytes into Blackboard buffer (one memcpy) |
| 55 | + │ |
| 56 | + ├── rustyblas::int8_gemm() / rustymkl::vsexp() ← operate on Blackboard buffers |
| 57 | + │ |
| 58 | + └── read results back from Blackboard ← pointer read, no copy |
| 59 | +``` |
| 60 | + |
| 61 | +**Key insight**: BindSpace fingerprints are `[u64; 256]` = 2048 bytes = exactly the CogRecord container size in rustynum. The Blackboard's `alloc_u8("containers", N * 2048)` can hold a batch of fingerprints for SIMD bulk operations, then results are read back. |
| 62 | + |
| 63 | +**The split-borrow is critical for GEMM-style ops**: `borrow_3_mut_f32("A", "B", "C")` gives three non-aliasing mutable slices from a shared `&self`, which Rust's normal borrow checker can't do with a single struct. This is exactly what's needed for batch Hamming (query in A, corpus chunk in B, distances in C). |
| 64 | + |
| 65 | +**No architectural conflict.** The two systems serve different purposes: |
| 66 | +- BindSpace = persistent cognitive addressing (owns the data) |
| 67 | +- Blackboard = transient SIMD compute scratch (borrows/copies for computation) |
| 68 | + |
| 69 | +For truly zero-copy batch operations, a thin adapter can provide `&[u8]` views of BindSpace fingerprint ranges directly to rustynum functions that take slices (most of rustyblas level-1 and rustynum-holo phase ops). No Blackboard needed for slice-based APIs. |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +## Phase 0 — Stable Rust Port (prerequisite) |
| 74 | + |
| 75 | +**Goal**: Make rustynum compile on stable Rust so it can be a dependency of ladybug-rs. |
| 76 | + |
| 77 | +### 0.1 Replace `portable_simd` with `std::arch` intrinsics |
| 78 | + |
| 79 | +Files requiring changes: |
| 80 | +- `rustynum-core/src/lib.rs` — remove `#![feature(portable_simd)]` |
| 81 | +- `rustynum-core/src/simd.rs` — rewrite `Simd<f32, 16>` → `__m512` with `_mm512_*` |
| 82 | +- `rustynum-rs/src/lib.rs` — remove feature gate |
| 83 | +- `rustynum-rs/src/simd_ops.rs` — rewrite portable SIMD ops to `std::arch` |
| 84 | +- `rustyblas/src/lib.rs` — remove feature gate |
| 85 | +- `rustyblas/src/level1.rs` — sdot/ddot/saxpy etc. already use raw intrinsics in hot paths |
| 86 | +- `rustyblas/src/level3.rs` — microkernels already use `_mm512_*` intrinsics |
| 87 | +- `rustymkl/src/lib.rs` — remove feature gate |
| 88 | +- `rustymkl/src/vml.rs` — vsexp/vsln etc. use `_mm512_*` (already stable-compatible) |
| 89 | +- `rustynum-archive/src/lib.rs` — remove feature gate |
| 90 | + |
| 91 | +**Estimate**: The actual SIMD kernels already use `std::arch` intrinsics. `portable_simd` is used for: |
| 92 | +1. Type aliases (`f32x16`, `u8x64`) — replace with `__m512`, `__m512i` |
| 93 | +2. Convenience ops (`.reduce_sum()`) — replace with `_mm512_reduce_add_ps()` |
| 94 | +3. `Simd::from_slice()` — replace with `_mm512_loadu_ps()` |
| 95 | + |
| 96 | +Most of rustyblas/rustymkl hot paths are already `std::arch`. This is primarily a cleanup. |
| 97 | + |
| 98 | +### 0.2 Add path dependencies to ladybug-rs |
| 99 | + |
| 100 | +```toml |
| 101 | +# In ladybug-rs/Cargo.toml [dependencies] |
| 102 | +rustynum-core = { path = "../rustynum/rustynum-core", default-features = false, features = ["avx512"], optional = true } |
| 103 | +rustyblas = { path = "../rustynum/rustyblas", default-features = false, features = ["avx512"], optional = true } |
| 104 | +rustymkl = { path = "../rustynum/rustymkl", default-features = false, features = ["avx512"], optional = true } |
| 105 | +rustynum-rs = { path = "../rustynum/rustynum-rs", optional = true } |
| 106 | +rustynum-holo = { path = "../rustynum/rustynum-holo", default-features = false, features = ["avx512"], optional = true } |
| 107 | + |
| 108 | +# Feature gate |
| 109 | +[features] |
| 110 | +rustynum = ["rustynum-core", "rustyblas", "rustymkl", "rustynum-rs", "rustynum-holo"] |
| 111 | +full = [..., "rustynum"] |
| 112 | +``` |
| 113 | + |
| 114 | +### 0.3 Verify compilation |
| 115 | + |
| 116 | +```bash |
| 117 | +cd ladybug-rs |
| 118 | +cargo check --features rustynum # must pass on stable |
| 119 | +cargo test --features rustynum # must pass |
| 120 | +``` |
| 121 | + |
| 122 | +--- |
| 123 | + |
| 124 | +## Phase 1 — Drop-In HDC Acceleration |
| 125 | + |
| 126 | +**Goal**: Replace ladybug's scalar/per-bit loops with rustynum's SIMD-vectorized equivalents. |
| 127 | + |
| 128 | +### 1.1 Bundle acceleration (highest impact) |
| 129 | + |
| 130 | +**Current** (`src/core/vsa.rs:55-93`): Bit-by-bit counting loop — O(N × 16384) with branch per bit. |
| 131 | +```rust |
| 132 | +// Current: 16384 iterations × N items × branch per bit |
| 133 | +let mut counts = [0u32; FINGERPRINT_U64 * 64]; |
| 134 | +for item in items { |
| 135 | + for (word_idx, &word) in item.as_raw().iter().enumerate() { |
| 136 | + for bit in 0..64 { |
| 137 | + if (word >> bit) & 1 == 1 { |
| 138 | + counts[word_idx * 64 + bit] += 1; |
| 139 | + } |
| 140 | + } |
| 141 | + } |
| 142 | +} |
| 143 | +``` |
| 144 | + |
| 145 | +**Replace with**: rustynum-rs `CogRecord::bundle()` which uses SIMD ripple-carry majority voting. Expected 17× speedup. |
| 146 | + |
| 147 | +**Implementation**: Add `fn bundle_simd(items: &[Fingerprint]) -> Fingerprint` in `src/core/vsa.rs` gated on `#[cfg(feature = "rustynum")]`, delegating to rustynum's bundle. The existing scalar path remains as fallback. |
| 148 | + |
| 149 | +### 1.2 Bind/XOR acceleration |
| 150 | + |
| 151 | +**Current** (`src/core/fingerprint.rs:151-157`): Scalar loop over 256 words. |
| 152 | +```rust |
| 153 | +for i in 0..FINGERPRINT_U64 { |
| 154 | + result[i] = self.data[i] ^ other.data[i]; |
| 155 | +} |
| 156 | +``` |
| 157 | + |
| 158 | +**Replace with**: rustynum-core SIMD XOR (processes 64 bytes per instruction on AVX-512 = 4 iterations instead of 256). Expected 8-16× speedup. |
| 159 | + |
| 160 | +### 1.3 Permute acceleration |
| 161 | + |
| 162 | +**Current** (`src/core/fingerprint.rs:167-191`): Bit-by-bit rotation — O(16384) with get_bit/set_bit per position. |
| 163 | + |
| 164 | +**Replace with**: Word-level rotation with carry (32 iterations on AVX-512). Expected 50-100× speedup. |
| 165 | + |
| 166 | +### 1.4 Popcount acceleration |
| 167 | + |
| 168 | +**Current** (`src/core/fingerprint.rs:110-112`): `iter().map(|x| x.count_ones()).sum()` — good but not SIMD-vectorized. |
| 169 | + |
| 170 | +**Replace with**: rustynum VPOPCNTDQ path (same as ladybug's `simd.rs` but unified). The ladybug `simd.rs` AVX-512 Hamming is already excellent — the win here is unification, not speedup. |
| 171 | + |
| 172 | +--- |
| 173 | + |
| 174 | +## Phase 2 — HDR Cascade Pre-Stage |
| 175 | + |
| 176 | +**Goal**: Add rustynum's INT8 quantization and prefilter as an optional cascade stage. |
| 177 | + |
| 178 | +### 2.1 INT8 sketch stage for HDR cascade |
| 179 | + |
| 180 | +**Current** (`src/search/hdr_cascade.rs`): 4-level cascade (1-bit → 4-bit → 8-bit → full popcount), all scalar. |
| 181 | + |
| 182 | +**Add**: INT8 quantized pre-stage using rustyblas `int8_gemm_i32`. |
| 183 | + |
| 184 | +``` |
| 185 | +New cascade: |
| 186 | + L-1: INT8 batch distance (VNNI vpdpbusd, 64 MACs/instruction) ← NEW |
| 187 | + L0: 1-bit sketch (existing) |
| 188 | + L1: 4-bit sketch (existing) |
| 189 | + L2: 8-bit sketch (existing) |
| 190 | + L3: Full popcount (existing) |
| 191 | +``` |
| 192 | + |
| 193 | +**How**: Quantize fingerprint chunks to i8 vectors, compute batch dot products via VNNI. Candidates below threshold skip to L0. This gives 4× throughput improvement for the initial filtering stage. |
| 194 | + |
| 195 | +### 2.2 Batch Hamming via Blackboard |
| 196 | + |
| 197 | +**Current** (`src/core/simd.rs:197-213`): Per-pair Hamming distance, parallelized with rayon. |
| 198 | + |
| 199 | +**Replace with**: Blackboard-based batch processing. Allocate corpus chunk in Blackboard, run rustynum's parallel_for_chunks() with SIMD Hamming. Avoids rayon overhead for small batches and exploits cache locality for large batches. |
| 200 | + |
| 201 | +--- |
| 202 | + |
| 203 | +## Phase 3 — Statistics & VML |
| 204 | + |
| 205 | +**Goal**: Replace manual math with SIMD-accelerated transcendentals and statistics. |
| 206 | + |
| 207 | +### 3.1 VML for truth value computation |
| 208 | + |
| 209 | +**Target**: `src/nars/truth.rs` — NARS truth value functions (frequency × confidence). |
| 210 | + |
| 211 | +Currently scalar `f32` operations. Batch truth evaluation across many edges can use rustymkl VML: |
| 212 | +- `vsexp()` for exponential decay |
| 213 | +- `vsln()` for log-evidence |
| 214 | +- `vssqrt()` for confidence intervals |
| 215 | + |
| 216 | +### 3.2 Statistics for temporal search |
| 217 | + |
| 218 | +**Target**: `src/search/temporal.rs` — autocorrelation, cross-correlation, variance. |
| 219 | + |
| 220 | +Replace manual variance/stddev with rustynum-rs statistics (SIMD-accelerated mean, std, variance). |
| 221 | + |
| 222 | +### 3.3 VML for spectroscopy |
| 223 | + |
| 224 | +**Target**: `src/container/spectroscopy/` — frequency analysis. |
| 225 | + |
| 226 | +Replace scalar log/sqrt/sin/cos with rustymkl VML batch operations. 16-wide f32 SIMD instead of one-at-a-time. |
| 227 | + |
| 228 | +--- |
| 229 | + |
| 230 | +## Phase 4 — Holographic Unification |
| 231 | + |
| 232 | +**Goal**: Connect ladybug's hologram extensions to rustynum-holo's principled implementations. |
| 233 | + |
| 234 | +### 4.1 Phase-space ops for quantum_field.rs |
| 235 | + |
| 236 | +**Target**: `src/extensions/hologram/quantum_field.rs` |
| 237 | + |
| 238 | +Replace ladybug's PhaseTag-based operations with rustynum-holo's phase-space primitives: |
| 239 | +- `phase_bind_i8()` / `phase_unbind_i8()` — reversible ADD/SUB mod 256 |
| 240 | +- `wasserstein_sorted_i8()` — Earth Mover's distance (new capability, not in ladybug) |
| 241 | +- `circular_distance_i8()` — wrap-around distance for unsorted vectors |
| 242 | + |
| 243 | +### 4.2 Carrier waveform for embedding encoding |
| 244 | + |
| 245 | +Ladybug's fingerprint→embedding pipeline can use rustynum-holo carrier encoding: |
| 246 | +- `carrier_encode()` — frequency-domain concept encoding with VNNI acceleration |
| 247 | +- `carrier_decode()` — demodulation via dot product |
| 248 | +- Fibonacci spacing avoids harmonic interference, enables ~16-item bundling |
| 249 | + |
| 250 | +### 4.3 Focus gating for scent extraction |
| 251 | + |
| 252 | +Replace ladybug's 5-byte "flavor" extraction (`src/core/scent.rs`) with rustynum-holo's principled focus-of-attention: |
| 253 | +- 3D geometric attention (8×8×32 volume) |
| 254 | +- 48-bit masks for non-overlapping concept allocation |
| 255 | +- `focus_xor()`, `focus_read()` — gated operations |
| 256 | + |
| 257 | +### 4.4 Gabor wavelets for hologram extensions |
| 258 | + |
| 259 | +Rustynum-holo's Gabor wavelet system subsumes phase+carrier+focus+5D projection into spatially-localized frequency encoding. This is the eventual target for ladybug's hologram extension modules. |
| 260 | + |
| 261 | +### 4.5 Organic X-Trans for write pipeline |
| 262 | + |
| 263 | +Ladybug's separate write → clean → learn pipeline can be replaced by rustynum-oracle's organic model where write=clean=learn in one pass, using X-Trans Fibonacci sampling. |
| 264 | + |
| 265 | +--- |
| 266 | + |
| 267 | +## Phase 5 — Foundations (LAPACK/FFT/GEMM) |
| 268 | + |
| 269 | +**Goal**: Use rustymkl for any dense linear algebra ladybug needs. |
| 270 | + |
| 271 | +### 5.1 LAPACK QR for orthogonalization |
| 272 | + |
| 273 | +Any Gram-Schmidt operations in learning paths can use `sgeqrf`/`dgeqrf` from rustymkl. |
| 274 | + |
| 275 | +### 5.2 FFT for spectroscopy |
| 276 | + |
| 277 | +`fft_f32`/`fft_f64` from rustymkl replaces any DFT needs in spectroscopy or frequency analysis. |
| 278 | + |
| 279 | +### 5.3 GEMM for dense matrix operations |
| 280 | + |
| 281 | +If ladybug ever needs dense matrix multiply (e.g., batch embedding transforms), rustyblas `sgemm` delivers 115 GFLOPS at 1024×1024. |
| 282 | + |
| 283 | +--- |
| 284 | + |
| 285 | +## Implementation Order & Dependencies |
| 286 | + |
| 287 | +``` |
| 288 | +Phase 0.1 (stable port) ← PREREQUISITE for everything |
| 289 | + │ |
| 290 | +Phase 0.2 (Cargo.toml wiring) |
| 291 | + │ |
| 292 | + ├── Phase 1.1 (bundle) ← highest user-visible impact |
| 293 | + ├── Phase 1.2 (bind/xor) ← simple, high frequency |
| 294 | + ├── Phase 1.3 (permute) ← moderate impact |
| 295 | + └── Phase 1.4 (popcount) ← unification, not speedup |
| 296 | + │ |
| 297 | + ├── Phase 2.1 (INT8 cascade) ← search throughput |
| 298 | + ├── Phase 2.2 (batch hamming) ← memory efficiency |
| 299 | + │ |
| 300 | + ├── Phase 3.1 (VML truth) ← NARS acceleration |
| 301 | + ├── Phase 3.2 (temporal stats) ← search quality |
| 302 | + └── Phase 3.3 (VML spectro) ← analysis speed |
| 303 | + │ |
| 304 | + ├── Phase 4.1-4.5 (holographic) ← deep integration |
| 305 | + └── Phase 5.1-5.3 (foundations) ← as-needed |
| 306 | +``` |
| 307 | + |
| 308 | +## Testing Strategy |
| 309 | + |
| 310 | +Each phase must: |
| 311 | +1. Run all existing ladybug-rs tests (`cargo test`) — no regressions |
| 312 | +2. Add `#[cfg(feature = "rustynum")]` + `#[cfg(not(feature = "rustynum"))]` dual paths |
| 313 | +3. Add comparative benchmarks (existing vs rustynum) in `benches/` |
| 314 | +4. Verify SIMD correctness: scalar fallback must produce identical results |
| 315 | + |
| 316 | +## Risk Assessment |
| 317 | + |
| 318 | +| Risk | Likelihood | Impact | Mitigation | |
| 319 | +|---|---|---|---| |
| 320 | +| `portable_simd` removal breaks rustynum tests | Medium | High | Run full rustynum test suite after port | |
| 321 | +| Fingerprint size mismatch (16K vs 2048-byte CogRecord) | Low | Medium | Adapt at boundary: FP = 2 × CogRecord containers | |
| 322 | +| Nightly-only users of rustynum break | Low | Low | Keep nightly feature gate as optional | |
| 323 | +| Arrow version drift | Low | Medium | Both at 57 now; pin together | |
| 324 | +| rand 0.8 vs 0.9 conflict | Low | Low | Update rustynum-oracle to rand 0.9 | |
0 commit comments