feat: Add SAM 3D Body — monocular 3D body mesh on Apple Silicon#922
Conversation
Port of Meta's SAM 3D Body to MLX for Apple Silicon inference. Takes a single RGB image + person bbox, outputs SMPL-compatible body mesh vertices (18,439), 70 3D keypoints, and camera params. Architecture: DINOv3 ViT-H+ backbone → ray-conditioned features → transformer decoder → MHR parametric body model. Includes: - Full model (backbone, decoder, MHR head, prompt encoder) - Weight converter (PyTorch checkpoint → safetensors) - Predictor API (generate.py) matching SAM 3/3.1 pattern - mlx-vlm exports (Model, ModelConfig, VisionModel, LanguageModel) Known limitations: - scatter_add uses numpy round-trip (no native MLX op) - Hand model inference not yet supported (body only) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Body model (3 critical): - Remove parameter_limits from inference (JIT model skips it) - Fix scale: exp(dof * ln(2)) not 1 + dof - Fix pose correctives: 750D 6D rotation features not 889D raw DOFs Pipeline (6 convention fixes): - CLIFF scale: bbox_width * 1.25 (not max(w,h) * 1.2) - Default focal: sqrt(H^2+W^2) (image diagonal) - Pelvis norm: average kp3d[9,10] (not joint_coords[0]) - Projection: use cam_int focal when available After fix: body model matches PyTorch JIT within 0.0001mm. Benchmark: MLX 1.37x faster than PyTorch/MPS on M3 Max. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follows the sam3_1 README structure. Covers architecture (DINOv3 backbone, transformer decoder, MHR body model with FK/skinning), quick start, CLI usage, weight conversion, benchmarks on M3 Max 36GB, and file layout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point to facebookresearch/sam-3d-body instead of placeholder URL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds previously-untracked estimator.py so the high-level SAM3DBodyEstimator API is actually part of the PR, not just the author's local working tree. mhr_head now returns pred_shape alongside pred_model_params in its forward output, and the estimator surfaces pred_pose (model_params [:136]) and pred_shape as numpy arrays on the returned dict. These are the parametric body outputs downstream rigging, animation, and inverse-kinematics code needs — matching the parity surface of the reference PyTorch implementation. Verified end-to-end via a synthetic-image smoke test (estimator loads, predicts, deterministic, all shapes match expected).
Delete 18 duplicate model files and replace sam3d_mlx with a thin sys.modules facade aliasing mlx_vlm.models.sam3d_body.* under the sam3d_mlx.* namespace. Every existing callsite across tests, scripts, backend, and benchmarks resolves through the facade unchanged — SAM3DBodyEstimator.__module__ is now mlx_vlm.models.sam3d_body.estimator with no wrapping layer, so class identity, isinstance, and pickling all work across the boundary. The baseball-specific glue stays local: __init__.py — the sys.modules facade __main__.py — CLI (python -m sam3d_mlx) sam31_detector.py — SAM 3.1 pitcher detector wrapper video.py — per-pitch video I/O and skeleton rendering Model code now lives upstream in mlx-vlm on branch feat/sam-3d-body (PR Blaizzy/mlx-vlm#922). Pinned via a new requirements.txt using a git+ URL. For local dev, install editable: pip install -e /path/to/mlx-vlm --break-system-packages Swap the git pin for a released version after PR 922 merges. Verification: tests/test_e2e.py still at 12/13 — the 2 failures are the pre-existing stale-key assertions around pred_pose/pred_shape, unchanged from the pre-decoupling baseline and not introduced here. Also adds tests/test_pr_smoke.py — minimal inference smoke test that imports directly from mlx_vlm.models.sam3d_body to validate the upstream side of the split independently of the facade. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
blueridanus
left a comment
There was a problem hiding this comment.
Cool! Works on my machine. Some thoughts:
- the hand pass is not implemented - left for a future PR? worth a mention
- mesh overlay for the single image inference would be useful as well (like you have for video)
| SAM 3D Body weights come as a PyTorch `.ckpt` plus a TorchScript JIT `.pt` file for the MHR body model. The converter handles both: | ||
|
|
||
| ```bash | ||
| python -m sam3d_mlx.convert_weights \ |
There was a problem hiding this comment.
All those one-liners in the README won't run as-is i.e. you'll want python -m mlx_vlm.sam3d_body.convert_weights etc instead.
There was a problem hiding this comment.
Done in a2da361 and 2deac90 — all python -m sam3d_mlx... CLI snippets rewritten to python -m mlx_vlm.models.sam3d_body....
| ├── batch_prep.py # Crop, resize, ImageNet normalize, CLIFF conditioning | ||
| ├── estimator.py # SAM3DBodyEstimator — preprocessing + inference + OBJ export | ||
| ├── generate.py # SAM3DPredictor — mlx-vlm from_pretrained/predict API | ||
| ├── video.py # Video pipeline with skeleton overlay rendering |
There was a problem hiding this comment.
-> video.py
This hasn't been committed.
There was a problem hiding this comment.
Committed in a2da361.
There was a problem hiding this comment.
Shipped in a2da361, thanks for flagging.
| SAM 3D Body weights come as a PyTorch `.ckpt` plus a TorchScript JIT `.pt` file for the MHR body model. The converter handles both: | ||
|
|
||
| ```bash | ||
| python -m sam3d_mlx.convert_weights \ |
There was a problem hiding this comment.
| python -m sam3d_mlx.convert_weights \ | |
| python -m mlx_vlm.sam3d_body.convert_weights \ |
There was a problem hiding this comment.
Applied in 2deac90.
There was a problem hiding this comment.
(Prince) TODO: remove in favour of sanitize at model level
There was a problem hiding this comment.
Addressed in 2deac90 — Model.sanitize() now owns all renames with a backward-compat canary keyed on backbone.encoder.cls_token / character_torch.*. convert_weights.py keeps its rename logic for existing pipelines; its docstring points at sanitize() as the canonical naming source.
Blaizzy
left a comment
There was a problem hiding this comment.
Overall LGTM, just missing vthe video file reported by @blueridanus, then we can merge
Addresses the two blocking review comments on PR Blaizzy#922: * Commits the previously uncommitted video.py referenced in the README (@blueridanus, @Blaizzy). * Rewrites CLI snippets from `python -m sam3d_mlx...` to the proper `python -m mlx_vlm.models.sam3d_body...` namespace. Adds overlay.py with single-image parity to the video pipeline: * draw_skeleton_overlay — pure OpenCV, no extra deps * render_mesh_overlay — photorealistic mesh via pyrender + trimesh (lazy imports so the module loads without them) * load_faces — reads head_pose.faces from safetensors with a cached faces.npy alongside the weights README also gets a Future Work section documenting the deferred hand-pose refinement pass (see Blaizzy's feedback on hand module status). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses Prince's TODO in convert_weights.py ("remove in favour of
sanitize at model level"). Key naming is now owned by Model.sanitize(),
which mlx_vlm.utils.load() calls automatically — so weights can be loaded
from HF Hub without any bespoke converter step.
sanitize() detects raw PyTorch checkpoint format via a canary
(`backbone.encoder.cls_token` or `character_torch.*`) and, when present,
runs the full rename pipeline via _remap_raw_pytorch_keys():
* Fused QKV split: backbone.encoder.blocks.N.attn.qkv -> q_proj/k_proj/v_proj
* Backbone prefix rewrite: backbone.encoder.* -> backbone.*
* Conv2d layout: (O,I,H,W) -> (O,H,W,I) for patch_embed, mask_downscaling,
ray_cond_emb.conv
* MHR JIT prefix mapping: character_torch.* -> mhr.character.*, etc.
When the canary is absent (already-sanitized safetensors on disk), the
remap pipeline is skipped entirely — existing /tmp/sam3d-mlx-weights/
layouts load unchanged. Verified on a 50-key sample from an existing
converted checkpoint: 3 hand/skip-filter drops, 0 renames.
convert_weights.py is kept for now for backward compatibility; its docstring
adds a pointer to Model.sanitize() as the canonical key-naming source.
README's convert command switches to the new namespace to match the other
-m invocations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the review @blueridanus @Blaizzy — update pushed (rebased onto the latest merge from main). New commits:
Addressed
Added
Let me know if anything wants tweaking. |





SAM 3D Body — monocular 3D body mesh on Apple Silicon via MLX
Adds SAM 3D Body (Meta Research) as a new model in mlx-vlm, running entirely on Apple Silicon via MLX. Produces 18,439-vertex body meshes + 70 3D keypoints from a single RGB image.
Architecture
PyTorch Parity (verified)
9 bugs fixed to achieve numerical match with the PyTorch/CUDA reference:
Body model (3 critical):
parameter_limitsfrom inference (JIT model skips it, was causing ~17mm FK errors)exp(dof × ln(2))not1 + dofPipeline (6 convention fixes):
bbox_width * 1.25matchingGetBBoxCenterScale(padding=1.25)sqrt(H²+W²)(image diagonal)cam_intfocal when availableResult: Body model matches PyTorch JIT within 0.0001mm across all 18,439 vertices.
Benchmark (M3 Max, 4s clip @ 60fps)
Files added
🤖 Generated with Claude Code