Skip to content

feat: Add SAM 3D Body — monocular 3D body mesh on Apple Silicon#922

Merged
Blaizzy merged 10 commits into
Blaizzy:mainfrom
shihwesley:feat/sam-3d-body
May 6, 2026
Merged

feat: Add SAM 3D Body — monocular 3D body mesh on Apple Silicon#922
Blaizzy merged 10 commits into
Blaizzy:mainfrom
shihwesley:feat/sam-3d-body

Conversation

@shihwesley
Copy link
Copy Markdown
Contributor

@shihwesley shihwesley commented Apr 4, 2026

SAM 3D Body — monocular 3D body mesh on Apple Silicon via MLX

Adds SAM 3D Body (Meta Research) as a new model in mlx-vlm, running entirely on Apple Silicon via MLX. Produces 18,439-vertex body meshes + 70 3D keypoints from a single RGB image.

Architecture

  • Backbone: DINOv3 ViT-H+ (32 blocks, 1280 dim, SwiGLU FFN, RoPE)
  • Decoder: 6-layer cross-attention transformer with LaPE, iterative refinement
  • Body model: MHR (Meta Human Representation) — FK, LBS, pose correctives, blend shapes

PyTorch Parity (verified)

9 bugs fixed to achieve numerical match with the PyTorch/CUDA reference:

Body model (3 critical):

  • Removed parameter_limits from inference (JIT model skips it, was causing ~17mm FK errors)
  • Fixed scale formula: exp(dof × ln(2)) not 1 + dof
  • Fixed pose correctives input: 750D 6D rotation features, not 889D raw joint DOFs

Pipeline (6 convention fixes):

  • CLIFF scale: bbox_width * 1.25 matching GetBBoxCenterScale(padding=1.25)
  • Default focal length: sqrt(H²+W²) (image diagonal)
  • Pelvis normalization: average of left/right hip keypoints
  • Perspective projection: uses cam_int focal when available

Result: Body model matches PyTorch JIT within 0.0001mm across all 18,439 vertices.

Benchmark (M3 Max, 4s clip @ 60fps)

Metric PyTorch/MPS MLX Speedup
Inference median 667ms 488ms 1.37x
Inf + mesh render 882ms 701ms 1.26x
Model load 9.7s 5.0s 2.0x
FPS (inference) 1.5 2.1

Files added

mlx_vlm/models/sam3d_body/
├── backbone.py          # DINOv3 ViT-H+ with SwiGLU, RoPE, LayerScale
├── batch_prep.py        # Image cropping, normalization, CLIFF conditioning
├── camera.py            # Ray map computation for camera conditioning
├── config.py            # Model configuration
├── convert_weights.py   # PyTorch checkpoint → MLX safetensors
├── decoder.py           # Cross-attention transformer decoder
├── generate.py          # Single-image prediction API
├── language.py          # mlx-vlm integration stub
├── layers.py            # SwiGLU FFN, LayerNorm32
├── mhr_body.py          # MHR body model (FK, LBS, blend shapes, pose correctives)
├── mhr_head.py          # MHR head (pose decoding, body model interface)
├── mhr_utils.py         # Rotation math (euler, quaternion, 6D)
├── model.py             # Full SAM3DBody model with iterative decoding
├── prompt_encoder.py    # Point/box prompt encoding
├── rope.py              # 2D Rotary Position Embeddings
├── transformer.py       # Decoder transformer layers
└── vision.py            # mlx-vlm integration stub

🤖 Generated with Claude Code

Port of Meta's SAM 3D Body to MLX for Apple Silicon inference.
Takes a single RGB image + person bbox, outputs SMPL-compatible
body mesh vertices (18,439), 70 3D keypoints, and camera params.

Architecture: DINOv3 ViT-H+ backbone → ray-conditioned features →
transformer decoder → MHR parametric body model.

Includes:
- Full model (backbone, decoder, MHR head, prompt encoder)
- Weight converter (PyTorch checkpoint → safetensors)
- Predictor API (generate.py) matching SAM 3/3.1 pattern
- mlx-vlm exports (Model, ModelConfig, VisionModel, LanguageModel)

Known limitations:
- scatter_add uses numpy round-trip (no native MLX op)
- Hand model inference not yet supported (body only)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@shihwesley
Copy link
Copy Markdown
Contributor Author

shihwesley commented Apr 5, 2026

Results

Target (PyTorch/MPS — full mesh render)

target_pytorch_mesh

Current (MLX on Apple Silicon — mesh overlay)

current_mlx_mesh
current_mlx_skeleton

Wesley Shih and others added 3 commits April 5, 2026 11:20
Body model (3 critical):
- Remove parameter_limits from inference (JIT model skips it)
- Fix scale: exp(dof * ln(2)) not 1 + dof
- Fix pose correctives: 750D 6D rotation features not 889D raw DOFs

Pipeline (6 convention fixes):
- CLIFF scale: bbox_width * 1.25 (not max(w,h) * 1.2)
- Default focal: sqrt(H^2+W^2) (image diagonal)
- Pelvis norm: average kp3d[9,10] (not joint_coords[0])
- Projection: use cam_int focal when available

After fix: body model matches PyTorch JIT within 0.0001mm.
Benchmark: MLX 1.37x faster than PyTorch/MPS on M3 Max.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follows the sam3_1 README structure. Covers architecture (DINOv3 backbone,
transformer decoder, MHR body model with FK/skinning), quick start,
CLI usage, weight conversion, benchmarks on M3 Max 36GB, and file layout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point to facebookresearch/sam-3d-body instead of placeholder URL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@shihwesley
Copy link
Copy Markdown
Contributor Author

shihwesley commented Apr 6, 2026

Updated results (post bug fixes)

| After fixing 3 body model bugs (parameter_limits at inference, scale formula, pose correctives input), vertex error dropped from ~17mm to <0.001mm vs PyTorch.

| skeleton
| mesh

| Inference: ~490ms/frame on M3 Max 36GB.

Adds previously-untracked estimator.py so the high-level
SAM3DBodyEstimator API is actually part of the PR, not just the
author's local working tree.

mhr_head now returns pred_shape alongside pred_model_params in its
forward output, and the estimator surfaces pred_pose (model_params
[:136]) and pred_shape as numpy arrays on the returned dict. These
are the parametric body outputs downstream rigging, animation, and
inverse-kinematics code needs — matching the parity surface of the
reference PyTorch implementation.

Verified end-to-end via a synthetic-image smoke test (estimator
loads, predicts, deterministic, all shapes match expected).
shihwesley pushed a commit to shihwesley/SamPlaysBaseball that referenced this pull request Apr 9, 2026
Delete 18 duplicate model files and replace sam3d_mlx with a thin
sys.modules facade aliasing mlx_vlm.models.sam3d_body.* under the
sam3d_mlx.* namespace. Every existing callsite across tests, scripts,
backend, and benchmarks resolves through the facade unchanged —
SAM3DBodyEstimator.__module__ is now mlx_vlm.models.sam3d_body.estimator
with no wrapping layer, so class identity, isinstance, and pickling
all work across the boundary.

The baseball-specific glue stays local:
  __init__.py        — the sys.modules facade
  __main__.py        — CLI (python -m sam3d_mlx)
  sam31_detector.py  — SAM 3.1 pitcher detector wrapper
  video.py           — per-pitch video I/O and skeleton rendering

Model code now lives upstream in mlx-vlm on branch feat/sam-3d-body
(PR Blaizzy/mlx-vlm#922). Pinned via a new requirements.txt using a
git+ URL. For local dev, install editable:

    pip install -e /path/to/mlx-vlm --break-system-packages

Swap the git pin for a released version after PR 922 merges.

Verification: tests/test_e2e.py still at 12/13 — the 2 failures are
the pre-existing stale-key assertions around pred_pose/pred_shape,
unchanged from the pre-decoupling baseline and not introduced here.

Also adds tests/test_pr_smoke.py — minimal inference smoke test that
imports directly from mlx_vlm.models.sam3d_body to validate the
upstream side of the split independently of the facade.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@blueridanus blueridanus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Works on my machine. Some thoughts:

  • the hand pass is not implemented - left for a future PR? worth a mention
  • mesh overlay for the single image inference would be useful as well (like you have for video)

Comment thread mlx_vlm/models/sam3d_body/README.md Outdated
SAM 3D Body weights come as a PyTorch `.ckpt` plus a TorchScript JIT `.pt` file for the MHR body model. The converter handles both:

```bash
python -m sam3d_mlx.convert_weights \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All those one-liners in the README won't run as-is i.e. you'll want python -m mlx_vlm.sam3d_body.convert_weights etc instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a2da361 and 2deac90 — all python -m sam3d_mlx... CLI snippets rewritten to python -m mlx_vlm.models.sam3d_body....

├── batch_prep.py # Crop, resize, ImageNet normalize, CLIFF conditioning
├── estimator.py # SAM3DBodyEstimator — preprocessing + inference + OBJ export
├── generate.py # SAM3DPredictor — mlx-vlm from_pretrained/predict API
├── video.py # Video pipeline with skeleton overlay rendering
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> video.py
This hasn't been committed.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shihwesley this is indeed missing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committed in a2da361.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shipped in a2da361, thanks for flagging.

Comment thread mlx_vlm/models/sam3d_body/README.md Outdated
SAM 3D Body weights come as a PyTorch `.ckpt` plus a TorchScript JIT `.pt` file for the MHR body model. The converter handles both:

```bash
python -m sam3d_mlx.convert_weights \
Copy link
Copy Markdown
Owner

@Blaizzy Blaizzy Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python -m sam3d_mlx.convert_weights \
python -m mlx_vlm.sam3d_body.convert_weights \

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 2deac90.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Prince) TODO: remove in favour of sanitize at model level

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2deac90Model.sanitize() now owns all renames with a backward-compat canary keyed on backbone.encoder.cls_token / character_torch.*. convert_weights.py keeps its rename logic for existing pipelines; its docstring points at sanitize() as the canonical naming source.

Copy link
Copy Markdown
Owner

@Blaizzy Blaizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just missing vthe video file reported by @blueridanus, then we can merge

Wesley Shih and others added 2 commits April 22, 2026 16:56
Addresses the two blocking review comments on PR Blaizzy#922:
  * Commits the previously uncommitted video.py referenced in the README
    (@blueridanus, @Blaizzy).
  * Rewrites CLI snippets from `python -m sam3d_mlx...` to the proper
    `python -m mlx_vlm.models.sam3d_body...` namespace.

Adds overlay.py with single-image parity to the video pipeline:
  * draw_skeleton_overlay — pure OpenCV, no extra deps
  * render_mesh_overlay   — photorealistic mesh via pyrender + trimesh
                            (lazy imports so the module loads without them)
  * load_faces            — reads head_pose.faces from safetensors with
                            a cached faces.npy alongside the weights

README also gets a Future Work section documenting the deferred hand-pose
refinement pass (see Blaizzy's feedback on hand module status).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses Prince's TODO in convert_weights.py ("remove in favour of
sanitize at model level"). Key naming is now owned by Model.sanitize(),
which mlx_vlm.utils.load() calls automatically — so weights can be loaded
from HF Hub without any bespoke converter step.

sanitize() detects raw PyTorch checkpoint format via a canary
(`backbone.encoder.cls_token` or `character_torch.*`) and, when present,
runs the full rename pipeline via _remap_raw_pytorch_keys():
  * Fused QKV split: backbone.encoder.blocks.N.attn.qkv -> q_proj/k_proj/v_proj
  * Backbone prefix rewrite: backbone.encoder.* -> backbone.*
  * Conv2d layout: (O,I,H,W) -> (O,H,W,I) for patch_embed, mask_downscaling,
    ray_cond_emb.conv
  * MHR JIT prefix mapping: character_torch.* -> mhr.character.*, etc.

When the canary is absent (already-sanitized safetensors on disk), the
remap pipeline is skipped entirely — existing /tmp/sam3d-mlx-weights/
layouts load unchanged. Verified on a 50-key sample from an existing
converted checkpoint: 3 hand/skip-filter drops, 0 renames.

convert_weights.py is kept for now for backward compatibility; its docstring
adds a pointer to Model.sanitize() as the canonical key-naming source.
README's convert command switches to the new namespace to match the other
-m invocations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shihwesley
Copy link
Copy Markdown
Contributor Author

Thanks for the review @blueridanus @Blaizzy — update pushed (rebased onto the latest merge from main).

New commits:

  • a2da361 — feat(sam3d_body): commit video.py and add single-image overlay helper
  • 2deac90 — refactor(sam3d_body): move weight renames into Model.sanitize()

Addressed

  1. Missing video.py (@blueridanus, @Blaizzy). Now committed at mlx_vlm/models/sam3d_body/video.py — the module the README references.

  2. README namespace (@blueridanus; @Blaizzy's suggested change). All python -m sam3d_mlx... invocations rewritten to python -m mlx_vlm.models.sam3d_body.... --help verified resolving for both .video and .convert_weights.

  3. sanitize() at model level (@Blaizzy's TODO). Model.sanitize() now owns all key renaming: fused-QKV split, backbone.encoder.* prefix rewrite, Conv2d (O,I,H,W)→(O,H,W,I) for patch_embed/mask_downscaling/ray_cond_emb, and MHR JIT prefix remap (character_torch.*mhr.character.*, etc.). A canary check keyed on backbone.encoder.cls_token / character_torch.* means already-converted safetensors pass through untouched — verified on an existing checkpoint (50-key sample: 3 filter drops, 0 renames). convert_weights.py keeps its rename logic for backward compat; its docstring now points at sanitize() as the canonical naming source.

Added

  1. Single-image mesh overlay (@blueridanus). New overlay.py with draw_skeleton_overlay (pure OpenCV, no extra deps) and render_mesh_overlay (pyrender + trimesh, lazy-imported so the module stays importable without them). Example added to the README's new Single-Image Overlay section.

  2. Hand pass deferral note (@blueridanus). New Future Work section explicitly calls out that decoder_hand.* / head_pose_hand.* weights are filtered out of this port and hand refinement is a follow-up PR. The full port plan is drafted; that PR will open once this one lands.

Let me know if anything wants tweaking.

@shihwesley shihwesley requested review from Blaizzy and blueridanus May 5, 2026 15:47
Copy link
Copy Markdown
Owner

@Blaizzy Blaizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @shihwesley!

@Blaizzy Blaizzy merged commit cbbc56f into Blaizzy:main May 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants