feat: Add SAM 3D Body — monocular 3D body mesh on Apple Silicon by shihwesley · Pull Request #922 · Blaizzy/mlx-vlm

shihwesley · 2026-04-04T18:53:46Z

SAM 3D Body — monocular 3D body mesh on Apple Silicon via MLX

Adds SAM 3D Body (Meta Research) as a new model in mlx-vlm, running entirely on Apple Silicon via MLX. Produces 18,439-vertex body meshes + 70 3D keypoints from a single RGB image.

Architecture

Backbone: DINOv3 ViT-H+ (32 blocks, 1280 dim, SwiGLU FFN, RoPE)
Decoder: 6-layer cross-attention transformer with LaPE, iterative refinement
Body model: MHR (Meta Human Representation) — FK, LBS, pose correctives, blend shapes

PyTorch Parity (verified)

9 bugs fixed to achieve numerical match with the PyTorch/CUDA reference:

Body model (3 critical):

Removed parameter_limits from inference (JIT model skips it, was causing ~17mm FK errors)
Fixed scale formula: exp(dof × ln(2)) not 1 + dof
Fixed pose correctives input: 750D 6D rotation features, not 889D raw joint DOFs

Pipeline (6 convention fixes):

CLIFF scale: bbox_width * 1.25 matching GetBBoxCenterScale(padding=1.25)
Default focal length: sqrt(H²+W²) (image diagonal)
Pelvis normalization: average of left/right hip keypoints
Perspective projection: uses cam_int focal when available

Result: Body model matches PyTorch JIT within 0.0001mm across all 18,439 vertices.

Benchmark (M3 Max, 4s clip @ 60fps)

Metric	PyTorch/MPS	MLX	Speedup
Inference median	667ms	488ms	1.37x
Inf + mesh render	882ms	701ms	1.26x
Model load	9.7s	5.0s	2.0x
FPS (inference)	1.5	2.1	—

Files added

mlx_vlm/models/sam3d_body/
├── backbone.py          # DINOv3 ViT-H+ with SwiGLU, RoPE, LayerScale
├── batch_prep.py        # Image cropping, normalization, CLIFF conditioning
├── camera.py            # Ray map computation for camera conditioning
├── config.py            # Model configuration
├── convert_weights.py   # PyTorch checkpoint → MLX safetensors
├── decoder.py           # Cross-attention transformer decoder
├── generate.py          # Single-image prediction API
├── language.py          # mlx-vlm integration stub
├── layers.py            # SwiGLU FFN, LayerNorm32
├── mhr_body.py          # MHR body model (FK, LBS, blend shapes, pose correctives)
├── mhr_head.py          # MHR head (pose decoding, body model interface)
├── mhr_utils.py         # Rotation math (euler, quaternion, 6D)
├── model.py             # Full SAM3DBody model with iterative decoding
├── prompt_encoder.py    # Point/box prompt encoding
├── rope.py              # 2D Rotary Position Embeddings
├── transformer.py       # Decoder transformer layers
└── vision.py            # mlx-vlm integration stub

🤖 Generated with Claude Code

Port of Meta's SAM 3D Body to MLX for Apple Silicon inference. Takes a single RGB image + person bbox, outputs SMPL-compatible body mesh vertices (18,439), 70 3D keypoints, and camera params. Architecture: DINOv3 ViT-H+ backbone → ray-conditioned features → transformer decoder → MHR parametric body model. Includes: - Full model (backbone, decoder, MHR head, prompt encoder) - Weight converter (PyTorch checkpoint → safetensors) - Predictor API (generate.py) matching SAM 3/3.1 pattern - mlx-vlm exports (Model, ModelConfig, VisionModel, LanguageModel) Known limitations: - scatter_add uses numpy round-trip (no native MLX op) - Hand model inference not yet supported (body only) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

shihwesley · 2026-04-05T04:21:10Z

Results

Target (PyTorch/MPS — full mesh render)

Current (MLX on Apple Silicon — mesh overlay)

Body model (3 critical): - Remove parameter_limits from inference (JIT model skips it) - Fix scale: exp(dof * ln(2)) not 1 + dof - Fix pose correctives: 750D 6D rotation features not 889D raw DOFs Pipeline (6 convention fixes): - CLIFF scale: bbox_width * 1.25 (not max(w,h) * 1.2) - Default focal: sqrt(H^2+W^2) (image diagonal) - Pelvis norm: average kp3d[9,10] (not joint_coords[0]) - Projection: use cam_int focal when available After fix: body model matches PyTorch JIT within 0.0001mm. Benchmark: MLX 1.37x faster than PyTorch/MPS on M3 Max. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Follows the sam3_1 README structure. Covers architecture (DINOv3 backbone, transformer decoder, MHR body model with FK/skinning), quick start, CLI usage, weight conversion, benchmarks on M3 Max 36GB, and file layout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Point to facebookresearch/sam-3d-body instead of placeholder URL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

shihwesley · 2026-04-06T00:51:26Z

Updated results (post bug fixes)

| After fixing 3 body model bugs (parameter_limits at inference, scale formula, pose correctives input), vertex error dropped from ~17mm to <0.001mm vs PyTorch.

|
|

| Inference: ~490ms/frame on M3 Max 36GB.

Adds previously-untracked estimator.py so the high-level SAM3DBodyEstimator API is actually part of the PR, not just the author's local working tree. mhr_head now returns pred_shape alongside pred_model_params in its forward output, and the estimator surfaces pred_pose (model_params [:136]) and pred_shape as numpy arrays on the returned dict. These are the parametric body outputs downstream rigging, animation, and inverse-kinematics code needs — matching the parity surface of the reference PyTorch implementation. Verified end-to-end via a synthetic-image smoke test (estimator loads, predicts, deterministic, all shapes match expected).

Delete 18 duplicate model files and replace sam3d_mlx with a thin sys.modules facade aliasing mlx_vlm.models.sam3d_body.* under the sam3d_mlx.* namespace. Every existing callsite across tests, scripts, backend, and benchmarks resolves through the facade unchanged — SAM3DBodyEstimator.__module__ is now mlx_vlm.models.sam3d_body.estimator with no wrapping layer, so class identity, isinstance, and pickling all work across the boundary. The baseball-specific glue stays local: __init__.py — the sys.modules facade __main__.py — CLI (python -m sam3d_mlx) sam31_detector.py — SAM 3.1 pitcher detector wrapper video.py — per-pitch video I/O and skeleton rendering Model code now lives upstream in mlx-vlm on branch feat/sam-3d-body (PR Blaizzy/mlx-vlm#922). Pinned via a new requirements.txt using a git+ URL. For local dev, install editable: pip install -e /path/to/mlx-vlm --break-system-packages Swap the git pin for a released version after PR 922 merges. Verification: tests/test_e2e.py still at 12/13 — the 2 failures are the pre-existing stale-key assertions around pred_pose/pred_shape, unchanged from the pre-decoupling baseline and not introduced here. Also adds tests/test_pr_smoke.py — minimal inference smoke test that imports directly from mlx_vlm.models.sam3d_body to validate the upstream side of the split independently of the facade. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

blueridanus

Cool! Works on my machine. Some thoughts:

the hand pass is not implemented - left for a future PR? worth a mention
mesh overlay for the single image inference would be useful as well (like you have for video)

blueridanus · 2026-04-17T20:29:04Z

+SAM 3D Body weights come as a PyTorch `.ckpt` plus a TorchScript JIT `.pt` file for the MHR body model. The converter handles both:
+
+```bash
+python -m sam3d_mlx.convert_weights \


All those one-liners in the README won't run as-is i.e. you'll want python -m mlx_vlm.sam3d_body.convert_weights etc instead.

Done in a2da361 and 2deac90 — all python -m sam3d_mlx... CLI snippets rewritten to python -m mlx_vlm.models.sam3d_body....

blueridanus · 2026-04-18T05:45:46Z

+├── batch_prep.py          # Crop, resize, ImageNet normalize, CLIFF conditioning
+├── estimator.py           # SAM3DBodyEstimator — preprocessing + inference + OBJ export
+├── generate.py            # SAM3DPredictor — mlx-vlm from_pretrained/predict API
+├── video.py               # Video pipeline with skeleton overlay rendering


-> video.py
This hasn't been committed.

@shihwesley this is indeed missing

Committed in a2da361.

Shipped in a2da361, thanks for flagging.

Blaizzy · 2026-04-18T12:12:58Z

+SAM 3D Body weights come as a PyTorch `.ckpt` plus a TorchScript JIT `.pt` file for the MHR body model. The converter handles both:
+
+```bash
+python -m sam3d_mlx.convert_weights \


Suggested change

python -m sam3d_mlx.convert_weights \

python -m mlx_vlm.sam3d_body.convert_weights \

Applied in 2deac90.

Blaizzy · 2026-04-18T13:27:20Z

(Prince) TODO: remove in favour of sanitize at model level

Addressed in 2deac90 — Model.sanitize() now owns all renames with a backward-compat canary keyed on backbone.encoder.cls_token / character_torch.*. convert_weights.py keeps its rename logic for existing pipelines; its docstring points at sanitize() as the canonical naming source.

Blaizzy

Overall LGTM, just missing vthe video file reported by @blueridanus, then we can merge

@blueridanus

Addresses the two blocking review comments on PR Blaizzy#922: * Commits the previously uncommitted video.py referenced in the README (@blueridanus, @Blaizzy). * Rewrites CLI snippets from `python -m sam3d_mlx...` to the proper `python -m mlx_vlm.models.sam3d_body...` namespace. Adds overlay.py with single-image parity to the video pipeline: * draw_skeleton_overlay — pure OpenCV, no extra deps * render_mesh_overlay — photorealistic mesh via pyrender + trimesh (lazy imports so the module loads without them) * load_faces — reads head_pose.faces from safetensors with a cached faces.npy alongside the weights README also gets a Future Work section documenting the deferred hand-pose refinement pass (see Blaizzy's feedback on hand module status). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses Prince's TODO in convert_weights.py ("remove in favour of sanitize at model level"). Key naming is now owned by Model.sanitize(), which mlx_vlm.utils.load() calls automatically — so weights can be loaded from HF Hub without any bespoke converter step. sanitize() detects raw PyTorch checkpoint format via a canary (`backbone.encoder.cls_token` or `character_torch.*`) and, when present, runs the full rename pipeline via _remap_raw_pytorch_keys(): * Fused QKV split: backbone.encoder.blocks.N.attn.qkv -> q_proj/k_proj/v_proj * Backbone prefix rewrite: backbone.encoder.* -> backbone.* * Conv2d layout: (O,I,H,W) -> (O,H,W,I) for patch_embed, mask_downscaling, ray_cond_emb.conv * MHR JIT prefix mapping: character_torch.* -> mhr.character.*, etc. When the canary is absent (already-sanitized safetensors on disk), the remap pipeline is skipped entirely — existing /tmp/sam3d-mlx-weights/ layouts load unchanged. Verified on a 50-key sample from an existing converted checkpoint: 3 hand/skip-filter drops, 0 renames. convert_weights.py is kept for now for backward compatibility; its docstring adds a pointer to Model.sanitize() as the canonical key-naming source. README's convert command switches to the new namespace to match the other -m invocations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shihwesley · 2026-04-22T21:56:43Z

Thanks for the review @blueridanus @Blaizzy — update pushed (rebased onto the latest merge from main).

New commits:

a2da361 — feat(sam3d_body): commit video.py and add single-image overlay helper
2deac90 — refactor(sam3d_body): move weight renames into Model.sanitize()

Addressed

Missing video.py (@blueridanus, @Blaizzy). Now committed at mlx_vlm/models/sam3d_body/video.py — the module the README references.
README namespace (@blueridanus; @Blaizzy's suggested change). All python -m sam3d_mlx... invocations rewritten to python -m mlx_vlm.models.sam3d_body.... --help verified resolving for both .video and .convert_weights.
sanitize() at model level (@Blaizzy's TODO). Model.sanitize() now owns all key renaming: fused-QKV split, backbone.encoder.* prefix rewrite, Conv2d (O,I,H,W)→(O,H,W,I) for patch_embed/mask_downscaling/ray_cond_emb, and MHR JIT prefix remap (character_torch.* → mhr.character.*, etc.). A canary check keyed on backbone.encoder.cls_token / character_torch.* means already-converted safetensors pass through untouched — verified on an existing checkpoint (50-key sample: 3 filter drops, 0 renames). convert_weights.py keeps its rename logic for backward compat; its docstring now points at sanitize() as the canonical naming source.

Added

Single-image mesh overlay (@blueridanus). New overlay.py with draw_skeleton_overlay (pure OpenCV, no extra deps) and render_mesh_overlay (pyrender + trimesh, lazy-imported so the module stays importable without them). Example added to the README's new Single-Image Overlay section.
Hand pass deferral note (@blueridanus). New Future Work section explicitly calls out that decoder_hand.* / head_pose_hand.* weights are filtered out of this port and hand refinement is a follow-up PR. The full port plan is drafted; that PR will open once this one lands.

Let me know if anything wants tweaking.

Blaizzy

LGTM, thanks @shihwesley!

Wesley Shih and others added 3 commits April 5, 2026 11:20

docs: fix SAM 3D Body upstream repo link

bb37ddb

Point to facebookresearch/sam-3d-body instead of placeholder URL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

blueridanus reviewed Apr 17, 2026

View reviewed changes

blueridanus suggested changes Apr 18, 2026

View reviewed changes

Merge branch 'main' into feat/sam-3d-body

ea55d74

Blaizzy reviewed Apr 18, 2026

View reviewed changes

Blaizzy requested changes Apr 18, 2026

View reviewed changes

Wesley Shih and others added 2 commits April 22, 2026 16:56

shihwesley requested review from Blaizzy and blueridanus May 5, 2026 15:47

Merge branch 'main' into feat/sam-3d-body

3361226

Blaizzy approved these changes May 6, 2026

View reviewed changes

style: run pre-commit on sam3d_body

feaa0d1

Blaizzy merged commit cbbc56f into Blaizzy:main May 6, 2026
1 check passed

	python -m sam3d_mlx.convert_weights \
	python -m mlx_vlm.sam3d_body.convert_weights \

Uh oh!

Conversation

shihwesley commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SAM 3D Body — monocular 3D body mesh on Apple Silicon via MLX

Architecture

PyTorch Parity (verified)

Benchmark (M3 Max, 4s clip @ 60fps)

Files added

Uh oh!

shihwesley commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Target (PyTorch/MPS — full mesh render)

Current (MLX on Apple Silicon — mesh overlay)

Uh oh!

shihwesley commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updated results (post bug fixes)

Uh oh!

blueridanus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Blaizzy Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Blaizzy left a comment

Choose a reason for hiding this comment

Uh oh!

shihwesley commented Apr 22, 2026

Uh oh!

Blaizzy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shihwesley commented Apr 4, 2026 •

edited

Loading

shihwesley commented Apr 5, 2026 •

edited

Loading

shihwesley commented Apr 6, 2026 •

edited

Loading

Blaizzy Apr 18, 2026 •

edited

Loading