Yes, with one blocker that requires a fix first.
| Property | ladybug-rs | rustynum | Compatible? |
|---|---|---|---|
| Edition | 2024 | 2021 | Yes (2024 builds 2021 deps) |
| Rust version | 1.88+ (stable) | nightly (portable_simd) |
BLOCKER |
| Arrow | 57 | 57 | Yes (exact match) |
| DataFusion | 51 | 51 (optional) | Yes |
| Lance | 2.0 | 2.0 (optional) | Yes |
| tokio | 1.49 | 1.x (optional) | Yes |
| rand | 0.9 | 0.8 (oracle only) | Minor conflict |
| SIMD approach | std::arch intrinsics |
portable_simd feature |
Different layers |
| Alignment | 64-byte (repr(align(64))) |
64-byte (alloc with ALIGNMENT=64) | Yes |
| External deps | arrow/datafusion/lance/rayon | zero (core+blas+mkl+holo) | No conflicts |
The blocker: rustynum uses #![feature(portable_simd)] in 5 crates (core, rs, blas, mkl, archive). This requires nightly Rust. Ladybug-rs compiles on stable 1.93.
Fix: Replace portable_simd with std::arch intrinsics (same approach ladybug already uses). The rustynum SIMD code already uses explicit AVX-512/AVX2 intrinsics in many hot paths — portable_simd is used mainly for convenience types (f32x16, u8x64). These can be rewritten as __m512/__m256 operations, which work on stable Rust.
Alternative: Add rust-toolchain.toml to ladybug-rs specifying nightly. Faster to ship but forces all ladybug contributors to nightly.
Recommendation: Phase 0 (below) rewrites rustynum to stable std::arch. This is a one-time cost that unblocks all downstream users.
Yes, they operate at different levels and compose naturally.
| Property | BindSpace | Blackboard |
|---|---|---|
| Purpose | 65K-address cognitive memory | SIMD-aligned computation arena |
| Granularity | Per-fingerprint (2KB each) | Per-named-buffer (arbitrary size) |
| Addressing | Addr(u16) → array index (3-5 cycles) |
String name → HashMap → raw ptr |
| Ownership | Owns BindNode structs (fingerprint + metadata) |
Owns raw byte allocations |
| Borrow model | &mut self for write, &self for read |
Split-borrow: multiple &mut [T] from &self |
| Thread safety | Send (single-owner) |
Send (single-owner, unsafe interior mut) |
| Memory layout | [u64; 256] per fingerprint + metadata fields |
64-byte aligned contiguous buffers |
How they compose (zero-copy path):
BindSpace (owns fingerprints)
│
├── node.fingerprint: [u64; 256] ← 2KB, 64-byte aligned (repr(align(64)))
│
│ For batch BLAS/VML operations:
│
├── Blackboard::alloc_u8("query", 2048) ← allocate scratch buffer
├── copy query fingerprint bytes into Blackboard buffer (one memcpy)
│
├── rustyblas::int8_gemm() / rustymkl::vsexp() ← operate on Blackboard buffers
│
└── read results back from Blackboard ← pointer read, no copy
Key insight: BindSpace fingerprints are [u64; 256] = 2048 bytes = exactly the CogRecord container size in rustynum. The Blackboard's alloc_u8("containers", N * 2048) can hold a batch of fingerprints for SIMD bulk operations, then results are read back.
The split-borrow is critical for GEMM-style ops: borrow_3_mut_f32("A", "B", "C") gives three non-aliasing mutable slices from a shared &self, which Rust's normal borrow checker can't do with a single struct. This is exactly what's needed for batch Hamming (query in A, corpus chunk in B, distances in C).
No architectural conflict. The two systems serve different purposes:
- BindSpace = persistent cognitive addressing (owns the data)
- Blackboard = transient SIMD compute scratch (borrows/copies for computation)
For truly zero-copy batch operations, a thin adapter can provide &[u8] views of BindSpace fingerprint ranges directly to rustynum functions that take slices (most of rustyblas level-1 and rustynum-holo phase ops). No Blackboard needed for slice-based APIs.
Goal: Make rustynum compile on stable Rust so it can be a dependency of ladybug-rs.
Files requiring changes:
rustynum-core/src/lib.rs— remove#![feature(portable_simd)]rustynum-core/src/simd.rs— rewriteSimd<f32, 16>→__m512with_mm512_*rustynum-rs/src/lib.rs— remove feature gaterustynum-rs/src/simd_ops.rs— rewrite portable SIMD ops tostd::archrustyblas/src/lib.rs— remove feature gaterustyblas/src/level1.rs— sdot/ddot/saxpy etc. already use raw intrinsics in hot pathsrustyblas/src/level3.rs— microkernels already use_mm512_*intrinsicsrustymkl/src/lib.rs— remove feature gaterustymkl/src/vml.rs— vsexp/vsln etc. use_mm512_*(already stable-compatible)rustynum-archive/src/lib.rs— remove feature gate
Estimate: The actual SIMD kernels already use std::arch intrinsics. portable_simd is used for:
- Type aliases (
f32x16,u8x64) — replace with__m512,__m512i - Convenience ops (
.reduce_sum()) — replace with_mm512_reduce_add_ps() Simd::from_slice()— replace with_mm512_loadu_ps()
Most of rustyblas/rustymkl hot paths are already std::arch. This is primarily a cleanup.
# In ladybug-rs/Cargo.toml [dependencies]
rustynum-core = { path = "../rustynum/rustynum-core", default-features = false, features = ["avx512"], optional = true }
rustyblas = { path = "../rustynum/rustyblas", default-features = false, features = ["avx512"], optional = true }
rustymkl = { path = "../rustynum/rustymkl", default-features = false, features = ["avx512"], optional = true }
rustynum-rs = { path = "../rustynum/rustynum-rs", optional = true }
rustynum-holo = { path = "../rustynum/rustynum-holo", default-features = false, features = ["avx512"], optional = true }
# Feature gate
[features]
rustynum = ["rustynum-core", "rustyblas", "rustymkl", "rustynum-rs", "rustynum-holo"]
full = [..., "rustynum"]cd ladybug-rs
cargo check --features rustynum # must pass on stable
cargo test --features rustynum # must passGoal: Replace ladybug's scalar/per-bit loops with rustynum's SIMD-vectorized equivalents.
Current (src/core/vsa.rs:55-93): Bit-by-bit counting loop — O(N × 16384) with branch per bit.
// Current: 16384 iterations × N items × branch per bit
let mut counts = [0u32; FINGERPRINT_U64 * 64];
for item in items {
for (word_idx, &word) in item.as_raw().iter().enumerate() {
for bit in 0..64 {
if (word >> bit) & 1 == 1 {
counts[word_idx * 64 + bit] += 1;
}
}
}
}Replace with: rustynum-rs CogRecord::bundle() which uses SIMD ripple-carry majority voting. Expected 17× speedup.
Implementation: Add fn bundle_simd(items: &[Fingerprint]) -> Fingerprint in src/core/vsa.rs gated on #[cfg(feature = "rustynum")], delegating to rustynum's bundle. The existing scalar path remains as fallback.
Current (src/core/fingerprint.rs:151-157): Scalar loop over 256 words.
for i in 0..FINGERPRINT_U64 {
result[i] = self.data[i] ^ other.data[i];
}Replace with: rustynum-core SIMD XOR (processes 64 bytes per instruction on AVX-512 = 4 iterations instead of 256). Expected 8-16× speedup.
Current (src/core/fingerprint.rs:167-191): Bit-by-bit rotation — O(16384) with get_bit/set_bit per position.
Replace with: Word-level rotation with carry (32 iterations on AVX-512). Expected 50-100× speedup.
Current (src/core/fingerprint.rs:110-112): iter().map(|x| x.count_ones()).sum() — good but not SIMD-vectorized.
Replace with: rustynum VPOPCNTDQ path (same as ladybug's simd.rs but unified). The ladybug simd.rs AVX-512 Hamming is already excellent — the win here is unification, not speedup.
Goal: Add rustynum's INT8 quantization and prefilter as an optional cascade stage.
Current (src/search/hdr_cascade.rs): 4-level cascade (1-bit → 4-bit → 8-bit → full popcount), all scalar.
Add: INT8 quantized pre-stage using rustyblas int8_gemm_i32.
New cascade:
L-1: INT8 batch distance (VNNI vpdpbusd, 64 MACs/instruction) ← NEW
L0: 1-bit sketch (existing)
L1: 4-bit sketch (existing)
L2: 8-bit sketch (existing)
L3: Full popcount (existing)
How: Quantize fingerprint chunks to i8 vectors, compute batch dot products via VNNI. Candidates below threshold skip to L0. This gives 4× throughput improvement for the initial filtering stage.
Current (src/core/simd.rs:197-213): Per-pair Hamming distance, parallelized with rayon.
Replace with: Blackboard-based batch processing. Allocate corpus chunk in Blackboard, run rustynum's parallel_for_chunks() with SIMD Hamming. Avoids rayon overhead for small batches and exploits cache locality for large batches.
Goal: Replace manual math with SIMD-accelerated transcendentals and statistics.
Target: src/nars/truth.rs — NARS truth value functions (frequency × confidence).
Currently scalar f32 operations. Batch truth evaluation across many edges can use rustymkl VML:
vsexp()for exponential decayvsln()for log-evidencevssqrt()for confidence intervals
Target: src/search/temporal.rs — autocorrelation, cross-correlation, variance.
Replace manual variance/stddev with rustynum-rs statistics (SIMD-accelerated mean, std, variance).
Target: src/container/spectroscopy/ — frequency analysis.
Replace scalar log/sqrt/sin/cos with rustymkl VML batch operations. 16-wide f32 SIMD instead of one-at-a-time.
Goal: Connect ladybug's hologram extensions to rustynum-holo's principled implementations.
Target: src/extensions/hologram/quantum_field.rs
Replace ladybug's PhaseTag-based operations with rustynum-holo's phase-space primitives:
phase_bind_i8()/phase_unbind_i8()— reversible ADD/SUB mod 256wasserstein_sorted_i8()— Earth Mover's distance (new capability, not in ladybug)circular_distance_i8()— wrap-around distance for unsorted vectors
Ladybug's fingerprint→embedding pipeline can use rustynum-holo carrier encoding:
carrier_encode()— frequency-domain concept encoding with VNNI accelerationcarrier_decode()— demodulation via dot product- Fibonacci spacing avoids harmonic interference, enables ~16-item bundling
Replace ladybug's 5-byte "flavor" extraction (src/core/scent.rs) with rustynum-holo's principled focus-of-attention:
- 3D geometric attention (8×8×32 volume)
- 48-bit masks for non-overlapping concept allocation
focus_xor(),focus_read()— gated operations
Rustynum-holo's Gabor wavelet system subsumes phase+carrier+focus+5D projection into spatially-localized frequency encoding. This is the eventual target for ladybug's hologram extension modules.
Ladybug's separate write → clean → learn pipeline can be replaced by rustynum-oracle's organic model where write=clean=learn in one pass, using X-Trans Fibonacci sampling.
Goal: Use rustymkl for any dense linear algebra ladybug needs.
Any Gram-Schmidt operations in learning paths can use sgeqrf/dgeqrf from rustymkl.
fft_f32/fft_f64 from rustymkl replaces any DFT needs in spectroscopy or frequency analysis.
If ladybug ever needs dense matrix multiply (e.g., batch embedding transforms), rustyblas sgemm delivers 115 GFLOPS at 1024×1024.
Phase 0.1 (stable port) ← PREREQUISITE for everything
│
Phase 0.2 (Cargo.toml wiring)
│
├── Phase 1.1 (bundle) ← highest user-visible impact
├── Phase 1.2 (bind/xor) ← simple, high frequency
├── Phase 1.3 (permute) ← moderate impact
└── Phase 1.4 (popcount) ← unification, not speedup
│
├── Phase 2.1 (INT8 cascade) ← search throughput
├── Phase 2.2 (batch hamming) ← memory efficiency
│
├── Phase 3.1 (VML truth) ← NARS acceleration
├── Phase 3.2 (temporal stats) ← search quality
└── Phase 3.3 (VML spectro) ← analysis speed
│
├── Phase 4.1-4.5 (holographic) ← deep integration
└── Phase 5.1-5.3 (foundations) ← as-needed
Each phase must:
- Run all existing ladybug-rs tests (
cargo test) — no regressions - Add
#[cfg(feature = "rustynum")]+#[cfg(not(feature = "rustynum"))]dual paths - Add comparative benchmarks (existing vs rustynum) in
benches/ - Verify SIMD correctness: scalar fallback must produce identical results
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
portable_simd removal breaks rustynum tests |
Medium | High | Run full rustynum test suite after port |
| Fingerprint size mismatch (16K vs 2048-byte CogRecord) | Low | Medium | Adapt at boundary: FP = 2 × CogRecord containers |
| Nightly-only users of rustynum break | Low | Low | Keep nightly feature gate as optional |
| Arrow version drift | Low | Medium | Both at 57 now; pin together |
| rand 0.8 vs 0.9 conflict | Low | Low | Update rustynum-oracle to rand 0.9 |