Skip to content

Commit d80f827

Browse files
🥂 v0.7 gpu-scaffold: omnimcode-gpu crate with wgpu, 4.04x on RX 580 via Vulkan
The user's primary GPU is an AMD Radeon RX 580 (Polaris/gfx803). Official ROCm dropped Polaris at version 4.0 and Ollama "gets fussy about it" — the wgpu/Vulkan path avoids that pain entirely. ## Architecture - New omnimcode-gpu crate: - ComputeBackend trait — one method (matmul) for v0.7 - Matrix — row-major f32 boundary type - CpuBackend — naive triple-loop, always-available ground truth - WgpuBackend (feature wgpu) — Vulkan/Metal/DX12/OpenGL compute - pick_backend() — feature + OMC_GPU_BACKEND env-aware - Naive WGSL matmul kernel (16x16 workgroup, no tiling) ## Measured on AMD RX 580 (RADV POLARIS10 / Vulkan) size (mxkxn) cpu ms wgpu ms speedup parity 64x64x64 0.052 0.228 0.23x OK 128x128x128 0.281 0.340 0.83x OK 256x256x256 1.966 0.880 2.24x OK 512x512x512 14.503 4.273 3.39x OK 1024x1024x1024 115.516 28.577 4.04x OK Crossover ~128x128. At 1024x1024, GPU is 4.04x faster than the naive CPU baseline. Parity passes at every size. ## Why wgpu instead of ROCm - Official ROCm dropped Polaris at 4.0; unofficial Polaris builds are fragile. - wgpu via Vulkan works out of the box on the open-source RADV driver with no SDK install. - The ComputeBackend trait is ready for ROCm/CUDA/Metal plug-ins when running on supported hardware. None of those are in v0.7 because Polaris (the user's target) doesn't benefit from them. ## Tests 11/11 GPU tests pass, including wgpu kernel parity check on the user's actual GPU (max diff < 1e-4). ## What's NOT in v0.7 - Prometheus integration (v0.8 candidate: route tape_matmul through this backend when shapes exceed CPU crossover) - Backward pass on GPU - Tiled / shared-memory kernels (untuned scaffold) - f16/bf16 - ROCm/CUDA/Metal backends (trait ready, impls deferred) ## Files - omnimcode-gpu/{Cargo.toml, README.md, src/{lib,cpu,wgpu_backend}.rs, shaders/matmul.wgsl, examples/bench_matmul.rs} - Cargo.toml — workspace member added 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 3f9957c commit d80f827

15 files changed

Lines changed: 2220 additions & 24 deletions

‎CHANGELOG.md‎

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
1313

1414
| Tag | Date | One-line |
1515
|---|---|---|
16+
| [v0.7-gpu-scaffold](#v07-gpu-scaffold--2026-05-17) | 2026-05-17 | GPU compute scaffold: `omnimcode-gpu` crate with wgpu (Vulkan) backend, ROCm/CUDA stubs. **4.04× speedup verified on the user's AMD RX 580** via Vulkan (no ROCm pain). |
1617
| [v0.6-fibtier-memory](#v06-fibtier-memory--2026-05-17) | 2026-05-17 | Fibtier-bounded eviction for memory: cap the index at fibonacci-tier capacity (default 232), evicted entries still recoverable by hash. Memory now safe for arbitrarily long agent sessions. |
1718
| [v0.5-substrate-memory](#v05-substrate-memory--2026-05-17) | 2026-05-17 | Substrate-keyed conversation memory: `omc_memory_store` / `recall` / `list` / `stats` MCP tools + filesystem-backed persistence. **Hits the 10× target** — measured 10.61× LLM context-budget reduction on a 20-turn agent task. |
1819
| [v0.4-substrate-context](#v04-substrate-context--2026-05-17) | 2026-05-17 | Symbolic compression end-to-end: `omc_compress_context` / `omc_decompress` tools + `format=codec` thumbnails + directory ingest. Measured 1.85×–2.81× LLM context budget reduction. |
@@ -29,6 +30,79 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
2930

3031
---
3132

33+
## [v0.7-gpu-scaffold] - 2026-05-17
34+
35+
**GPU compute scaffold for Prometheus: `omnimcode-gpu` crate with wgpu (Vulkan) backend, ROCm/CUDA stubs, 4.04× speedup verified end-to-end on the user's AMD RX 580 via Vulkan.**
36+
37+
The Polaris-friendly path. The user's primary target is an AMD RX 580 (gfx803), which official ROCm dropped at version 4.0 and Ollama explicitly struggles with. wgpu via Vulkan works out of the box on the same hardware with the open-source RADV driver — no ROCm install, no crash risk.
38+
39+
### What changed
40+
41+
- **New `omnimcode-gpu` crate**:
42+
- `ComputeBackend` trait — one method (`matmul`) for v0.7, open for extension
43+
- `Matrix` — row-major f32 tensor, the boundary type
44+
- `CpuBackend` — naive triple-loop, always available, ground-truth parity reference
45+
- `WgpuBackend` (feature `wgpu`) — Vulkan / Metal / DX12 / OpenGL compute
46+
- `pick_backend()` — runtime-chooses based on built-in features + `OMC_GPU_BACKEND` env override
47+
- **Matmul kernel** in WGSL: 16×16 workgroup, one thread per output cell, no tiling (the scaffold's job is to be honest, not tuned)
48+
- **Bench example** (`examples/bench_matmul.rs`): CPU vs GPU wall-clock + parity check across sizes
49+
- **Workspace integration**: `omnimcode-gpu` added to root Cargo.toml workspace members
50+
51+
### Measured on the target hardware (AMD RX 580 / RADV Vulkan)
52+
53+
```
54+
adapter: AMD Radeon RX 580 Series (RADV POLARIS10) — Vulkan
55+
56+
size (m x k x n) cpu ms wgpu ms speedup parity
57+
64x64x64 0.052 0.228 0.23x OK
58+
128x128x128 0.281 0.340 0.83x OK
59+
256x256x256 1.966 0.880 2.24x OK
60+
512x512x512 14.503 4.273 3.39x OK
61+
1024x1024x1024 115.516 28.577 4.04x OK
62+
```
63+
64+
Crossover at ~128×128. By 1024×1024, GPU is **4.04× faster than the naive CPU baseline**. Parity passes at every size (GPU output matches CPU output within f32 rounding).
65+
66+
### Why wgpu over ROCm
67+
68+
The honest situation for the user's hardware:
69+
70+
- **Official ROCm dropped Polaris (gfx803) at version 4.0.** Newer ROCm releases don't ship kernels for this GPU.
71+
- **Unofficial Polaris ROCm builds exist** but they're community-maintained and fragile — "Ollama gets fussy about it" was the user's verbatim experience, which matches the broader pattern.
72+
- **Vulkan compute works out of the box** on the same hardware via the open-source RADV driver. The Mesa-driven Vulkan path is stable and well-tested.
73+
74+
So wgpu is the default. The `ComputeBackend` trait is ready for ROCm/CUDA backends to plug in when running on supported hardware — but no SDK install attempt on this machine.
75+
76+
### Tests
77+
78+
11/11 GPU tests pass, including the wgpu kernel parity check on the user's actual GPU:
79+
- `cpu_matmul_*` — basic, identity, shape-mismatch
80+
- `wgpu_matmul_basic_2x3_3x2` — small-shape parity
81+
- `wgpu_matmul_matches_cpu_8x8` — larger parity, max diff < 1e-4
82+
- `wgpu_shape_mismatch_errors` — error handling
83+
- `matrix_new_*` / `max_abs_diff_*` / `pick_backend_returns_cpu_when_env_forces` — utilities
84+
85+
### What's NOT in v0.7
86+
87+
- **Prometheus integration.** The tape ops in `examples/lib/prometheus.omc` still run pure-OMC. v0.8 candidate: route `tape_matmul` through this backend when shapes exceed the CPU-crossover threshold.
88+
- **Backward pass on GPU.** Only forward matmul. Backward requires the autotape to live on GPU too.
89+
- **Tiled / shared-memory kernels.** The wgpu shader is naive. Tuned kernels would extract more from the hardware.
90+
- **f16 / bfloat16.** f32 only.
91+
- **ROCm / CUDA / Metal backends.** Trait is ready; impls are deferred until on supported hardware.
92+
93+
### Files
94+
95+
- `omnimcode-gpu/Cargo.toml` — crate manifest, wgpu as optional feature
96+
- `omnimcode-gpu/src/lib.rs` — trait + Matrix + pick_backend
97+
- `omnimcode-gpu/src/cpu.rs` — CPU backend
98+
- `omnimcode-gpu/src/wgpu_backend.rs` — wgpu backend
99+
- `omnimcode-gpu/shaders/matmul.wgsl` — compute kernel
100+
- `omnimcode-gpu/examples/bench_matmul.rs` — bench harness
101+
- `omnimcode-gpu/README.md` — usage + measured speedups
102+
- `Cargo.toml` — workspace member added
103+
104+
---
105+
32106
## [v0.6-fibtier-memory] - 2026-05-17
33107

34108
**Fibtier-bounded eviction for `MemoryStore`: memory growth is now safe for arbitrarily long agent sessions, and evicted entries remain recoverable by hash.**

0 commit comments

Comments
 (0)