Skip to content

Commit c859f6f

Browse files
authored
security(occworld-candle): int32-checkpoint crash + degenerate-input guards + ADR-179 (closes Milestone #9) (ruvnet#1101)
* fix(occworld-candle): security review fixes — int32 checkpoint crash + predict input validation Beyond-SOTA security + correctness review of wifi-densepose-occworld-candle (Milestone #9, crate 4/4 — the last ungated crate). Findings fixed: 1. HIGH (MEASURED) — checkpoint-load crash on any int32 tensor. model.rs mapped safetensors I32 -> candle DType::I64 and passed the raw int32 byte buffer (4 bytes/elem) to Tensor::from_raw_buffer(.., I64, ..). Candle derives elem_count = data.len() / dtype.size(), so the I64 path halved the count while keeping the original shape -> a tensor whose shape claims 2x its storage. Reading it PANICS (slice OOB: "range end index 6 out of range for slice of length 3") on any checkpoint containing an int32 tensor. Fixed: I32 -> DType::I32, I16 -> DType::I16 (both first-class candle dtypes). Reproduced on old code; pinned in tests/checkpoint_loading.rs. 2. LOW (MEASURED) — predict() lacked frame/batch validation at the input boundary. f_in > num_frames*2 over-indexed the temporal embedding (cryptic candle "gather" error); zero frame/batch fed a zero-element tensor in. Now rejected with a clear ShapeMismatch. Pinned in tests/input_validation.rs. 3. LOW (MEASURED) — divide-by-zero panic in the public VQCodebook::encode on a rank-0 / empty-last-dim tensor (last == 0). Now fails closed with a clear error. Pinned in vqvae.rs unit tests. Dimensions confirmed clean with evidence: panic surface (no unwrap/expect/ panic in prod paths), NaN-state-poisoning (N/A — stateless engine, u8 input), unbounded-alloc/shape-data mismatch (defended upstream by safetensors:: validate), secrets (none). unsafe_code = forbid. Validation (MEASURED, Windows): crate 31/31 pass; workspace 0 failed (lone desktop api_integration "Access is denied" file-lock flake passes 21/21 in isolation); Python proof VERDICT PASS, hash f8e76f21…446f7a unchanged. Warrants ADR slot 179 (parent to author). Co-Authored-By: claude-flow <ruv@ruv.net> * docs(adr): ADR-179 — occworld-candle checkpoint-load hardening (closes Milestone #9) Records the HIGH int32-checkpoint crash fix (I32→I64 dtype-widening → slice-OOB panic on load = DoS) + 2 LOW degenerate-input fixes from 5e77f47e5. Stateless engine (NaN-poisoning N/A), unsafe forbidden, safetensors validate() defends malloc upstream. occworld 31/31. Final ungated crate — Milestone #9 complete. Co-Authored-By: claude-flow <ruv@ruv.net>
1 parent 10c813f commit c859f6f

7 files changed

Lines changed: 471 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# ADR-179: `wifi-densepose-occworld-candle` Checkpoint-Load Hardening
2+
3+
| Field | Value |
4+
|-------|-------|
5+
| **Status** | Accepted — 1 HIGH + 2 LOW bugs fixed + pinned (MEASURED on Windows) |
6+
| **Date** | 2026-06-15 |
7+
| **Deciders** | ruv |
8+
| **Codename** | **OCCWORLD-DTYPE** |
9+
| **Reviews** | `wifi-densepose-occworld-candle` (Candle occupancy-world model) |
10+
| **Milestone** | #9 (ungated-crate security sweep) — crate 4 of 4 — **CLOSES the milestone** |
11+
12+
## Context
13+
14+
`wifi-densepose-occworld-candle` is a Candle-based occupancy-world model
15+
(VQ-VAE + transformer over occupancy tokens). The real risk surface for an ML
16+
crate is degenerate-input / malformed-weights handling: a `#[forbid(unsafe_code)]`
17+
crate can still **panic** (a DoS, and under WASM an abort) when a tensor op hits an
18+
inconsistent shape. The crate **builds and tests on Windows**, so all findings are
19+
MEASURED.
20+
21+
## Decision
22+
23+
Fix the three reachable bugs, each pinned by a fails-on-old test; attest the rest
24+
clean with evidence.
25+
26+
### Findings fixed (all MEASURED)
27+
28+
| # | Severity | Location | Issue | Fix |
29+
|---|----------|----------|-------|-----|
30+
| 1 | **HIGH** | `model.rs:95` (`Dtype::I32 => Some(DType::I64)`) | **Crash on any int32-tensor checkpoint.** An I32 byte buffer (4 B/elem) is handed to `from_raw_buffer(.., I64, shape, ..)`; candle derives `elem_count = data.len()/8`, **halving** the count while keeping the original shape → a tensor that claims 2× its storage. Reading it **panics** with a slice-OOB (`range end index 6 out of range for slice of length 3`) inside candle-core. A checkpoint with any int32 tensor (index/buffer tensors are common in PyTorch exports) → **DoS on load**. | Map `I32 → DType::I32`, `I16 → DType::I16` (both first-class candle dtypes). Pinned by `int32_tensor_loads_with_consistent_shape_and_values` (panics on old, passes on new). |
31+
| 2 | LOW | `inference.rs::predict` | Frame/batch dims weren't validated (only H/W/D were): `f_in > num_frames*2` over-indexes the temporal embedding → a cryptic candle `InvalidIndex` *error* (not a panic — candle bounds-checks); zero frame/batch feeds a zero-element tensor. | Boundary guard rejects zero / over-capacity frame+batch with a clear `ShapeMismatch`. 5 pins. |
32+
| 3 | LOW | `vqvae.rs:141` (`z.elem_count() / last`) | **Divide-by-zero panic** in public `VQCodebook::encode` on a rank-0 / empty-last-dim tensor (`last == 0`). | Fail-closed guard returns a clear error. Pinned by `encode_rejects_scalar_without_panicking`. |
33+
34+
The HIGH finding is the notable one: the crate's own dtype mapping **defeated**
35+
the upstream `safetensors::validate()` byte-length guarantee by misdeclaring the
36+
dtype — the one place malformed/widened weights could reach a panicking candle op.
37+
38+
### Dimensions confirmed clean (with evidence)
39+
40+
- **Panic surface** — grep for `unwrap()/expect()/panic!/unreachable!` across `src/`
41+
**zero in production paths**; all ops use `?`/`map_err`; the `last().unwrap_or(&0)`
42+
is now guarded. `as` casts operate only on config-bounded/internal values.
43+
- **NaN-state-poisoning (the named class) — N/A.** The engine is **stateless between
44+
`predict` calls** (no persistent world-model buffer to latch into), and input is
45+
`u8` class indices (non-finite input structurally impossible). NaN weights flow to
46+
`argmax` (deterministic, bounded to a valid class index) — no panic, no persistence.
47+
- **Unbounded alloc / shape-data mismatch from malformed weights** — defended upstream
48+
by `safetensors::validate()` (overflow-checked `nelements*dtype.size()` vs declared
49+
byte range + contiguous-offset + buffer-length checks), rejected before reaching
50+
candle. Finding #1 was the one place the crate defeated that guarantee.
51+
- **Model/path loading**`load`/`load_safetensors` check `path.exists()` → typed
52+
`CheckpointNotFound`; corrupt bytes → `CheckpointParse` (pinned). No path-traversal
53+
surface (caller-supplied path, opened read-only, never joined with untrusted segments).
54+
- **Secrets** — grep clean (only `token_h`/`token_w` config fields match `token`).
55+
- **Determinism** — the crate's central honesty claim, verified by the pre-existing
56+
`tests/predict_honesty.rs` (3 tests, still pass).
57+
- `unsafe_code = "forbid"` in the manifest.
58+
59+
## Validation
60+
61+
- `cargo test -p wifi-densepose-occworld-candle --no-default-features`**31/31**
62+
(lib 17, checkpoint_loading 4, input_validation 5, predict_honesty 3, doctests 2),
63+
0 failed.
64+
- `cargo test --workspace --no-default-features` → 0 failed across every crate (a lone
65+
`wifi-densepose-desktop --test api_integration` "Access is denied (os error 5)" was a
66+
Windows file-lock/AV flake — re-ran isolated 21/21, unrelated).
67+
- `python archive/v1/data/proof/verify.py`**VERDICT: PASS**, hash `f8e76f21…46f7a`
68+
unchanged (occworld off the signal proof path).
69+
70+
## Consequences
71+
72+
### Positive
73+
- A checkpoint-load DoS (the int32 dtype-widening panic) and two degenerate-input
74+
panics are closed in the world-model crate, each pinned. **Milestone #9 (all 4
75+
ungated crates) is complete.**
76+
77+
### Negative / Neutral
78+
- None. Guards reject only malformed/degenerate inputs.
79+
80+
## Links
81+
- ADR-176 / ADR-177 / ADR-178 — sibling Milestone-#9 reviews (ruview-swarm, nvsim, desktop)

v2/crates/wifi-densepose-occworld-candle/src/inference.rs

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,27 @@ impl OccWorldCandle {
206206
)));
207207
}
208208

209+
// Validate the externally-supplied frame and batch counts at this
210+
// system boundary. The temporal positional embedding has only
211+
// `num_frames * 2` rows, so a larger `f_in` would over-index the
212+
// embedding table deep inside the transformer and surface as a cryptic
213+
// "gather" index error; a zero frame/batch count would feed a
214+
// zero-element tensor into the reshape/conv pipeline. Reject both here
215+
// with a clear, domain-level error instead.
216+
if f_in == 0 || b == 0 {
217+
return Err(OccWorldError::ShapeMismatch(format!(
218+
"past_occupancy must have non-zero batch and frame dims, got \
219+
batch={b}, frames={f_in}"
220+
)));
221+
}
222+
if f_in > cfg.num_frames * 2 {
223+
return Err(OccWorldError::ShapeMismatch(format!(
224+
"past_occupancy frame count {f_in} exceeds the temporal embedding \
225+
capacity ({} = num_frames*2)",
226+
cfg.num_frames * 2
227+
)));
228+
}
229+
209230
// ── Step 1: VQVAE encode each past frame ──────────────────────────
210231
// Flatten batch*frames: (B, F, H, W, D) → (B*F, H, W, D)
211232
let occ_flat = past_occupancy
@@ -455,4 +476,8 @@ mod tests {
455476
"expected CheckpointNotFound, got {result:?}"
456477
);
457478
}
479+
480+
// The `predict` input-validation boundary guards (zero/over-capacity frame
481+
// counts) live in `tests/input_validation.rs` so they exercise only the
482+
// public API and keep this file under the 500-line limit.
458483
}

v2/crates/wifi-densepose-occworld-candle/src/model.rs

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,8 +92,21 @@ fn safetensor_dtype_to_candle(dt: safetensors::Dtype) -> Option<candle_core::DTy
9292
Dtype::F64 => Some(DType::F64),
9393
Dtype::F16 => Some(DType::F16),
9494
Dtype::BF16 => Some(DType::BF16),
95-
Dtype::I32 => Some(DType::I64), // widen for Candle compatibility
95+
// I32 MUST map to DType::I32, not I64. `Tensor::from_raw_buffer`
96+
// derives its element count from `data.len() / dtype.size_in_bytes()`;
97+
// handing an int32 byte buffer (4 bytes/elem) to the I64 path
98+
// (8 bytes/elem) halves the element count while keeping the original
99+
// shape, producing a tensor whose declared shape claims twice as many
100+
// elements as its storage holds. That silent shape/storage mismatch
101+
// panics (slice OOB) the moment the tensor is read — a crash on any
102+
// checkpoint containing an int32 tensor. See
103+
// `tests/checkpoint_loading.rs::int32_tensor_loads_with_consistent_shape_and_values`.
104+
Dtype::I32 => Some(DType::I32),
96105
Dtype::I64 => Some(DType::I64),
106+
// I16 is also a first-class Candle dtype (2 bytes/elem); map it
107+
// directly rather than rejecting it, for the same byte-size-correctness
108+
// reason as I32 above.
109+
Dtype::I16 => Some(DType::I16),
97110
Dtype::U8 => Some(DType::U8),
98111
Dtype::U32 => Some(DType::U32),
99112
_ => None,

v2/crates/wifi-densepose-occworld-candle/src/vqvae.rs

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,17 @@ impl VQCodebook {
137137
let orig_shape = z.shape().clone();
138138
let orig_dims = orig_shape.dims().to_vec();
139139
let last = *orig_shape.dims().last().unwrap_or(&0);
140+
// Guard the divide below: a scalar (rank-0) or empty-last-dim tensor
141+
// would make `last == 0` and panic on the `elem_count() / last`
142+
// division. `encode` is a `pub fn` on a `pub struct`, so this is a
143+
// reachable public boundary — fail closed with a clear error instead.
144+
if last == 0 {
145+
return Err(candle_core::Error::Msg(format!(
146+
"VQCodebook::encode expects a tensor with a non-zero last dim of \
147+
size embed_dim={}, got shape {orig_dims:?}",
148+
self.embed_dim
149+
)));
150+
}
140151
// Flatten to (N, embed_dim)
141152
let n = z.elem_count() / last;
142153
let z_flat = z.reshape((n, last))?; // (N, D)
@@ -339,6 +350,21 @@ mod tests {
339350
Ok(())
340351
}
341352

353+
#[test]
354+
fn encode_rejects_scalar_without_panicking() {
355+
// A rank-0 (scalar) tensor has an empty dims list → `last == 0`.
356+
// Before the guard this divided by zero and panicked; now it returns
357+
// a clean error. `encode` is public, so this is a reachable boundary.
358+
let device = Device::Cpu;
359+
let codebook = VQCodebook::dummy(4, 8, &device).unwrap();
360+
let scalar = Tensor::from_vec(vec![1.0f32], (), &device).unwrap();
361+
let result = codebook.encode(&scalar);
362+
assert!(
363+
result.is_err(),
364+
"scalar input must error, not panic; got {result:?}"
365+
);
366+
}
367+
342368
#[test]
343369
fn test_fold_unfold_roundtrip() -> candle_core::Result<()> {
344370
let device = Device::Cpu;
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
//! Checkpoint-loading robustness tests for `crate::model::load_safetensors`.
2+
//!
3+
//! Security review (Milestone #9, crate 4/4). These tests pin the behaviour of
4+
//! the SafeTensors weight-loading path against malformed / degenerate
5+
//! checkpoints — the only externally-controlled file-input surface in the crate.
6+
//!
7+
//! The headline regression is the **int32 dtype-widening byte-size bug**
8+
//! (`security/occworld-candle` finding #1): `model.rs` mapped
9+
//! `safetensors::Dtype::I32` → `candle_core::DType::I64` and then handed the
10+
//! raw *int32* byte buffer (4 bytes/elem) to `Tensor::from_raw_buffer(.., I64,
11+
//! shape, ..)`. Candle's `from_raw_buffer` computes `elem_count =
12+
//! data.len() / 8`, producing a tensor whose declared shape claims twice as
13+
//! many elements as the backing storage actually holds — a silent
14+
//! shape/storage inconsistency on attacker-supplied checkpoints.
15+
//!
16+
//! `build_safetensors` hand-assembles the binary container
17+
//! (`<u64 LE header_len><JSON header><raw data>`) so the test states exactly
18+
//! what bytes reach the loader, independent of the `safetensors` writer API.
19+
20+
use candle_core::Device;
21+
use wifi_densepose_occworld_candle::model::load_safetensors;
22+
23+
/// Hand-build a single-tensor SafeTensors buffer.
24+
///
25+
/// `dtype` is the safetensors dtype string (e.g. `"I32"`, `"F32"`).
26+
/// `shape` is the declared shape. `data` is the raw little-endian tensor bytes
27+
/// — the caller is responsible for making `data.len()` consistent with
28+
/// `shape × dtype_size` (safetensors itself validates this, so an inconsistent
29+
/// pair is rejected before reaching the candle conversion).
30+
fn build_safetensors(name: &str, dtype: &str, shape: &[usize], data: &[u8]) -> Vec<u8> {
31+
let shape_json: Vec<String> = shape.iter().map(|d| d.to_string()).collect();
32+
let header = format!(
33+
"{{\"{name}\":{{\"dtype\":\"{dtype}\",\"shape\":[{}],\"data_offsets\":[0,{}]}}}}",
34+
shape_json.join(","),
35+
data.len()
36+
);
37+
let header_bytes = header.into_bytes();
38+
let mut buf = Vec::new();
39+
buf.extend_from_slice(&(header_bytes.len() as u64).to_le_bytes());
40+
buf.extend_from_slice(&header_bytes);
41+
buf.extend_from_slice(data);
42+
buf
43+
}
44+
45+
fn write_temp(bytes: &[u8], stem: &str) -> std::path::PathBuf {
46+
let mut p = std::env::temp_dir();
47+
p.push(format!(
48+
"occworld_ckpt_{stem}_{}_{}.safetensors",
49+
std::process::id(),
50+
// nanosecond-ish disambiguator so parallel tests never collide
51+
std::time::SystemTime::now()
52+
.duration_since(std::time::UNIX_EPOCH)
53+
.map(|d| d.as_nanos())
54+
.unwrap_or(0)
55+
));
56+
std::fs::write(&p, bytes).expect("write temp checkpoint");
57+
p
58+
}
59+
60+
/// REGRESSION (finding #1): an int32 tensor in a checkpoint must load into a
61+
/// tensor whose element count matches its declared shape.
62+
///
63+
/// On the OLD code (`I32 -> DType::I64`) the 6-element int32 tensor below was
64+
/// handed to `from_raw_buffer(.., I64, [2,3], ..)`, which derived
65+
/// `elem_count = 24 bytes / 8 = 3` and built a 3-element storage carrying a
66+
/// shape claiming 6 elements — reading it panicked with a slice-OOB
67+
/// (`range end index 6 out of range for slice of length 3`). On the FIXED code
68+
/// (`I32 -> DType::I32`) the tensor round-trips: dtype I32, 6 elements,
69+
/// values `[1,2,3,4,5,6]`.
70+
#[test]
71+
fn int32_tensor_loads_with_consistent_shape_and_values() {
72+
let device = Device::Cpu;
73+
let shape = [2usize, 3];
74+
let vals: [i32; 6] = [1, 2, 3, 4, 5, 6];
75+
let mut data = Vec::with_capacity(24);
76+
for v in vals {
77+
data.extend_from_slice(&v.to_le_bytes());
78+
}
79+
let bytes = build_safetensors("quantize.embedding.weight", "I32", &shape, &data);
80+
let path = write_temp(&bytes, "i32");
81+
82+
let map = load_safetensors(&path, &device).expect("int32 checkpoint must load");
83+
let t = map
84+
.get("quantize.embedding.weight")
85+
.expect("mapped key present");
86+
87+
// The declared shape's element count MUST equal the storage's element
88+
// count. On the old code these disagreed (6 vs 3).
89+
assert_eq!(
90+
t.dims(),
91+
&[2, 3],
92+
"int32 tensor must preserve its declared shape"
93+
);
94+
assert_eq!(
95+
t.elem_count(),
96+
6,
97+
"element count must match shape — storage/shape consistency"
98+
);
99+
100+
// The dtype must be I32 — the int32 byte buffer is interpreted as int32,
101+
// not reinterpreted as half as many int64 lanes.
102+
assert_eq!(
103+
t.dtype(),
104+
candle_core::DType::I32,
105+
"int32 checkpoint tensor must load as DType::I32"
106+
);
107+
108+
// And the values must be exactly recovered (no reinterpretation of two
109+
// int32 lanes as one int64). This is the strongest proof the dtype is
110+
// handled correctly end-to-end.
111+
let flat = t.flatten_all().expect("flatten");
112+
let got: Vec<i32> = flat.to_vec1::<i32>().expect("to_vec i32");
113+
assert_eq!(
114+
got,
115+
vec![1i32, 2, 3, 4, 5, 6],
116+
"int32 values must be recovered exactly"
117+
);
118+
119+
let _ = std::fs::remove_file(&path);
120+
}
121+
122+
/// A well-formed F32 tensor must round-trip unchanged (control case — proves
123+
/// the fix does not regress the common float path).
124+
#[test]
125+
fn f32_tensor_round_trips() {
126+
let device = Device::Cpu;
127+
let shape = [4usize];
128+
let vals: [f32; 4] = [0.5, -1.0, 2.25, 3.0];
129+
let mut data = Vec::with_capacity(16);
130+
for v in vals {
131+
data.extend_from_slice(&v.to_le_bytes());
132+
}
133+
let bytes = build_safetensors("post_quant_conv.bias", "F32", &shape, &data);
134+
let path = write_temp(&bytes, "f32");
135+
136+
let map = load_safetensors(&path, &device).expect("f32 checkpoint must load");
137+
let t = map.get("post_quant_conv.bias").expect("key present");
138+
assert_eq!(t.dims(), &[4]);
139+
let got: Vec<f32> = t.to_vec1::<f32>().expect("to_vec f32");
140+
assert_eq!(got, vec![0.5, -1.0, 2.25, 3.0]);
141+
142+
let _ = std::fs::remove_file(&path);
143+
}
144+
145+
/// A truncated / corrupt header must produce a parse error, never a panic.
146+
/// (Defense-in-depth: the loader is fed an untrusted file.)
147+
#[test]
148+
fn corrupt_checkpoint_errors_cleanly() {
149+
let device = Device::Cpu;
150+
// Garbage that is not a valid SafeTensors container.
151+
let bytes = vec![0xFFu8; 32];
152+
let path = write_temp(&bytes, "corrupt");
153+
154+
let result = load_safetensors(&path, &device);
155+
assert!(
156+
result.is_err(),
157+
"corrupt checkpoint must error, got Ok: {result:?}"
158+
);
159+
160+
let _ = std::fs::remove_file(&path);
161+
}
162+
163+
/// An int64 tensor must still load correctly (proves the fix narrows only the
164+
/// I32 mapping and leaves the genuine I64 path intact).
165+
#[test]
166+
fn int64_tensor_round_trips() {
167+
let device = Device::Cpu;
168+
let shape = [3usize];
169+
let vals: [i64; 3] = [10, -20, 30];
170+
let mut data = Vec::with_capacity(24);
171+
for v in vals {
172+
data.extend_from_slice(&v.to_le_bytes());
173+
}
174+
let bytes = build_safetensors("transformer.output_head.bias", "I64", &shape, &data);
175+
let path = write_temp(&bytes, "i64");
176+
177+
let map = load_safetensors(&path, &device).expect("i64 checkpoint must load");
178+
let t = map.get("transformer.output_head.bias").expect("key present");
179+
assert_eq!(t.dims(), &[3]);
180+
assert_eq!(t.elem_count(), 3);
181+
let got: Vec<i64> = t.to_vec1::<i64>().expect("to_vec i64");
182+
assert_eq!(got, vec![10, -20, 30]);
183+
184+
let _ = std::fs::remove_file(&path);
185+
}

0 commit comments

Comments
 (0)