Skip to content

Add V-JEPA 2 (Meta FAIR) distributed training test case#1035

Merged
KeitaW merged 18 commits into
mainfrom
feat/vjepa2-distributed-training
May 20, 2026
Merged

Add V-JEPA 2 (Meta FAIR) distributed training test case#1035
KeitaW merged 18 commits into
mainfrom
feat/vjepa2-distributed-training

Conversation

@paragao
Copy link
Copy Markdown
Contributor

@paragao paragao commented Mar 23, 2026

Summary

  • Add V-JEPA 2 (Meta FAIR) ViT-g/16 1B-parameter self-supervised video model as a new PyTorch distributed training test case
  • Includes Slurm (Pyxis/Enroot) and Kubernetes (PyTorchJob) deployment manifests
  • Benchmarked on 8x p5en.48xlarge (64x NVIDIA H200 GPUs)

What is V-JEPA 2?

V-JEPA 2 is Meta FAIR's self-supervised video model that learns visual representations by predicting masked video patches. It achieves state-of-the-art on motion understanding and human action anticipation benchmarks. The ViT-g/16 variant has 1.03B encoder parameters.

Files Added

3.test_cases/pytorch/vjepa2/
├── vjepa2.Dockerfile                      # NVIDIA PyTorch 25.03 base (CUDA 13, Python 3.11)
├── README.md                              # Full walkthrough with benchmark results
├── slurm/
│   ├── benchmark_training.sbatch          # 200-iter benchmark (8 nodes)
│   ├── launch_training.sbatch             # Full 800-epoch pre-training
│   └── download_dataset.sbatch            # SSv2 dataset preparation
├── kubernetes/
│   └── vjepa2-benchmark.yaml              # PyTorchJob for EKS clusters
├── configs/
│   ├── benchmark-vitg-8nodes.yaml         # Quick benchmark config
│   └── pretrain-vitg-256px-16f.yaml       # Full pre-training config
└── scripts/
    ├── run_train.py                       # Thin srun-compatible launcher
    ├── generate_synthetic_dataset.py      # Synthetic video generator
    ├── prepare_ssv2.py                    # SSv2 CSV preparation
    ├── parse_benchmark.py                 # Log parser for throughput/MFU
    └── test_decord.py                     # Verify decord video loading

Key Technical Details

Launch pattern: V-JEPA 2 uses srun directly (not srun + torchrun). The run_train.py launcher calls app.vjepa.train.main() directly, which reads SLURM_LOCALID/SLURM_NTASKS/SLURM_PROCID for distributed setup. This avoids a bug in app/main.py where its subprocess launcher passes world_size=1 regardless of SLURM configuration.

Dataset: Supports both Something-Something v2 (SSv2) real data and synthetic generated videos for benchmarking.

Benchmark Results (8x p5en.48xlarge, 64x H200)

Metric Value
Global batch size 1,536
Precision BF16
Peak GPU memory ~32.9 GB / 143 GB

Testing

Validated on ParallelCluster with 8x p5en.48xlarge nodes running Slurm + Pyxis/Enroot with EFA networking. Job ran 200 iterations to completion with all 64 ranks correctly initialized via NCCL over EFA.

paragao added 2 commits March 23, 2026 12:24
Add V-JEPA 2 (Meta FAIR) ViT-g/16 1B-param self-supervised video model as
a new PyTorch test case with Slurm and Kubernetes support.

Includes:
- Dockerfile based on nvcr.io/nvidia/pytorch:25.03-py3 (CUDA 13 + Python 3.11)
- Slurm sbatch scripts for benchmark (200 iters) and full pre-training (800 epochs)
- Kubernetes PyTorchJob manifest for EKS clusters
- Thin srun-compatible launcher (run_train.py) that calls app.vjepa.train.main()
  directly, avoiding the subprocess world_size=1 bug in app/main.py
- Synthetic dataset generator for benchmarking without SSv2 download
- SSv2 dataset preparation scripts and decord verification
- YAML configs for ViT-g/16 with DDP, BF16, and activation checkpointing
…ining

Add V-JEPA 2.1 (Meta FAIR) ViT-g/16 1B-param benchmark alongside the existing
V-JEPA 2 test case. V-JEPA 2.1 introduces Dense Predictive Loss, Deep
Self-Supervision (4 intermediate layers), doubled predictor depth (24 vs 12),
and image+video co-training with 50/50 rank split.

Includes:
- Dockerfile and Enroot container setup (shared base with V-JEPA 2)
- Slurm sbatch scripts with /workspace code overlay for latest vjepa2 repo
- Kubernetes PyTorchJob manifest for EKS clusters
- Synthetic image generator for co-training benchmarks
- run_train.py launcher using app.scaffold.main() for dynamic dispatch
- YAML configs with img_data, img_mask, and rank_ratio settings

Key discovery: the container must have the latest vjepa2 repo code (post
March 2026) for app/vjepa_2_1/ to be available. The sbatch scripts mount
updated code at /workspace to overlay the container's stale PYTHONPATH.
@paragao paragao force-pushed the feat/vjepa2-distributed-training branch from 11b8971 to 92abb8c Compare March 23, 2026 12:27
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/3 — Structure & Repository Hygiene

Thanks for this thorough contribution, Paulo! The utility scripts and READMEs are excellent quality. I have some structural and reproducibility findings below.

Significant code duplication between vjepa2/ and vjepa2.1/

These two directories share a large amount of identical code:

  • scripts/generate_synthetic_dataset.py — identical (same git blob 74f922445)
  • scripts/parse_benchmark.py — identical (same git blob 957b9efdf)
  • scripts/prepare_ssv2.py — identical (same git blob 633288d17)
  • scripts/test_decord.py — identical (same git blob 4881d1647)
  • scripts/run_train.py — nearly identical (V-JEPA 2.1 adds 4 lines)
  • Dockerfiles — nearly identical structure
  • Slurm sbatch scripts — same structure, differing only in paths/config references

The repo convention says to "extend the existing test case — add platform-specific subdirectories, parameterize scripts for additional models, or add configuration variants — rather than creating a parallel directory tree with duplicated Dockerfiles, training scripts, and utilities."

I'd suggest consolidating into a single vjepa2/ directory that supports both V-JEPA 2 and 2.1 via different configs. The run_train.py launcher already dispatches based on the app field in the config (vjepa vs vjepa_2_1), so both versions can share the same launcher, scripts, Dockerfile, and sbatch templates. The V-JEPA 2.1 additions (image co-training, synthetic image generator) would simply add to the existing directory.

Missing license headers on README and config files

Both README.md files and all 4 configs/*.yaml files are missing license headers. The Slurm scripts, Python files, K8s manifests, and Dockerfiles all have them, so this is just an oversight. I'd suggest adding the standard header as a YAML comment in configs and HTML comment in READMEs.

Comment thread 3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile Outdated
Comment thread 3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile Outdated
Comment thread 3.test_cases/pytorch/vjepa2/kubernetes/vjepa2-benchmark.yaml Outdated
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/3 — Deployment Pipeline

Comment thread 3.test_cases/pytorch/vjepa2/kubernetes/vjepa2-benchmark.yaml Outdated
Comment thread 3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile Outdated
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/3 — Documentation Consistency


Things That Look Great

  • Comprehensive utility scripts: The synthetic data generators (video and image), SSv2 CSV preparer, benchmark log parser, and decord test script form a complete toolkit that makes this test case truly self-contained.
  • Excellent README documentation: Both READMEs walk through every step from dataset prep to result parsing, with clear architecture notes explaining the srun direct launch pattern and why app/main.py doesn't work with SLURM.
  • Smart launch pattern: Using app.scaffold.main() to dispatch based on the config's app field is elegant and avoids the world_size=1 bug in app/main.py.
  • Proper license headers on most files: Scripts, Dockerfiles, sbatch files, and K8s manifests all have the standard copyright header.
  • HyperPod auto-resume detection: The if [ -d "/opt/sagemaker_cluster" ] pattern in sbatch scripts correctly detects HyperPod clusters and enables auto-resume.
  • Both Slurm and Kubernetes deployment paths: Providing PyTorchJob manifests alongside Slurm scripts makes this accessible to EKS-based clusters too.
  • Well-structured config separation: Benchmark configs (200 iterations, no checkpointing) vs. full pre-training configs (800+ epochs, regular checkpoints) give users clear starting points for different use cases.
  • V-JEPA 2.1 comparison table: The feature comparison table in the V-JEPA 2.1 README clearly explains what changed between versions.

Comment thread 3.test_cases/pytorch/vjepa2/README.md Outdated
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

paragao added 13 commits March 23, 2026 16:40
The Dockerfile-based container (pytorch:25.03-py3) ships NCCL 2.25 and
an older aws-ofi-nccl plugin that are incompatible with B200 EFA
networking.  The B200 scripts use a NeMo container with NCCL 2.29+ and
a matching OFI/EFA/libfabric stack instead, with V-JEPA dependencies
installed to shared storage and added to PYTHONPATH at runtime.
Benchmarking on B200 revealed that 5,000 synthetic samples caused frequent
data loader re-initialization between epochs, inflating V-JEPA 2.1 iteration
times by up to 4x (15,300ms vs 4,075ms with 50K samples). V-JEPA 2 was less
affected but still improved from 1,637ms to 1,457ms.

Changes:
- Default synthetic video count: 5,000 -> 50,000 (both V-JEPA 2 and 2.1)
- Default synthetic image count: 5,000 -> 50,000 (V-JEPA 2.1)
- Add OpenCV (cv2) fallback for video generation in environments without ffmpeg
- Add dataset sizing guidance to benchmark configs and READMEs
Add rank-selective nsys profiling infrastructure:
- nsys_wrapper.sh: profiles only rank 0 via SLURM_PROCID check
- nsys_profile_b200.sbatch: configurable via NSYS_PROFILE_DIR and CONFIG
  env vars to save each optimization phase to a separate folder
- Document profiling workflow in both READMEs
…n checkpointing

Provide an optimized benchmark config for B200 GPUs:
- compile_model: true for fused kernels (~20% GPU speedup)
- use_activation_checkpointing: false (trades ~95 GB vs ~33 GB memory)
- num_workers: 20 for higher data prefetch
Tested at 1,125 ms/iter vs 1,457 ms baseline (23% improvement).
BF16 has the same dynamic range as FP32, so GradScaler's loss scaling
is pure overhead. Monkey-patch GradScaler to enabled=False in both
run_train.py launchers when meta.dtype is bfloat16, eliminating the
scale/unscale/step/update cycle per iteration.
…higher throughput

Replace DDP with FSDP (SHARD_GRAD_OP / ZeRO-2) for the encoder and
target_encoder in V-JEPA 2.1, sharding gradients and optimizer states
across ranks. This saves ~15 GB/GPU, enabling activation checkpointing
to be disabled on B200 GPUs for higher throughput. The predictor remains
DDP-wrapped (small model, needs find_unused_parameters).
…_train.py

- Fix compile_model placement: move from meta: to model: section where
  upstream train.py actually reads it (was silently never enabled)
- Add env-var-driven optimizations to run_train.py: fused AdamW, TF32,
  compile mode override, gradient_as_bucket_view, prefetch_factor
- Add B200 optimization sweep sbatch (Phase A-D with nsys profiling)
- Add nsys profiling sbatch scripts for H200 (vjepa2 and vjepa2.1)
- Fix container-workdir from /vjepa2 to /workspace in benchmark sbatch
- Add .gitignore to exclude benchmarks/ and profiling/ from repo
…u_type flag

The 6*N*D FLOP formula overestimates training FLOPs by ~2x for JEPA architectures
because the context encoder only processes visible tokens (~15% of the sequence)
while the target encoder runs forward-only (no backward pass).

Replace with samples/sec as the primary throughput metric. Add --gpu_type flag
(h200/b200) with correct BF16 peak specs (989.4 / 2250 TFLOPS). Fix V-JEPA 2.1
script title and update README parse examples to use new flag.
…ctness, pinned versions

- Remove ,eth from NCCL_SOCKET_IFNAME exclusion list for correct TCP bootstrap
- Add missing MIT-0 license headers to .gitignore, README.md, and config YAMLs
- Change set -ex to set -euo pipefail in all sbatch and shell scripts
- Pin EFA_INSTALLER_VERSION to 1.47.0 in both Dockerfiles (was 'latest')
- Replace :latest image tags with :vjepa2 and :vjepa2.1 in K8s manifests
- Use yaml.SafeLoader instead of yaml.FullLoader in run_train.py and run_train_fsdp.py
…lone

- Pin all pip packages to tested versions from cluster container freeze
- Pin vjepa2 git clone to commit 204698b4 (latest as of Mar 23, 2026)
- Fix stale repository URL from aws-samples to awslabs in both READMEs
@paragao
Copy link
Copy Markdown
Contributor Author

paragao commented Apr 15, 2026

@KeitaW please check that everything has been properly addressed so we can merge this.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Review Batch 1/3 — Acknowledgments + Structure & Repository Hygiene

Thanks Paulo for working through the previous round of feedback. Most of the smaller items from the first pass are now resolved — pinned EFA, pinned vjepa2 commit, pinned pip packages, license headers added, eth removed from NCCL_SOCKET_IFNAME, awslabs URL fix, :latest removed from K8s images, set -euo pipefail adopted, and yaml.SafeLoader swapped in. Nice cleanup pass.

The main item still outstanding is the structural one: vjepa2/ and vjepa2.1/ are still two parallel trees with mostly-identical Dockerfiles, scripts, and sbatch files. Several new files added in the recent commits (FSDP launcher, nsys wrappers, B200 sbatches, sweep configs) duplicate that surface again. I'd really like to land that consolidation before merge — it's the single biggest maintenance lever for this contribution.

What's been addressed since the last review

  • EFA pinned to 1.47.0 (meets CI minimum)
  • vjepa2 git clone pinned to commit 204698b4...
  • All pip packages pinned to tested versions
  • License headers added to READMEs, configs, and .gitignore
  • NCCL_SOCKET_IFNAME no longer excludes eth* (now ^docker,lo,veth everywhere)
  • set -euo pipefail adopted across sbatch scripts
  • K8s :latest replaced with :vjepa2 / :vjepa2.1 tags
  • Stale aws-samples URL replaced with awslabs in both READMEs
  • yaml.SafeLoader instead of yaml.FullLoader in launchers

Code duplication between vjepa2/ and vjepa2.1/ is still the headline issue

Looking at the current state, the duplication has actually grown since the last review. The two trees now share, in addition to the originally-flagged scripts, near-identical copies of:

  • vjepa2.Dockerfile and vjepa2_1.Dockerfile — the only meaningful difference is one extra pinned package (Pillow==12.0.0); the vjepa2_1.Dockerfile even comments at the top "V-JEPA 2.1 reuses the same container as V-JEPA 2, since both training apps live in the same facebookresearch/vjepa2 repository"
  • scripts/nsys_wrapper.sh — identical
  • slurm/benchmark_training_b200.sbatch, slurm/nsys_profile.sbatch, slurm/nsys_profile_b200.sbatch, slurm/optim_sweep_b200.sbatch — same skeleton, only paths and config refs differ
  • README sections on prerequisites, container build, parsing, and profiling

The vjepa2.1/README.md even reaches across the boundary several times — ## 2. Datasets says "See the V-JEPA 2 test case for SSv2 download" and the benchmark-results section points at ../vjepa2/benchmarks/vjepa2-benchmark-results.md. Cross-references like that are a strong tell that the directories want to be one.

run_train.py already dispatches via app.scaffold.main(params["app"], ...) based on the YAML's app field, so a single launcher (and Dockerfile, and sbatch templates) can serve both app: vjepa and app: vjepa_2_1 configs. I'd suggest restructuring as:

3.test_cases/pytorch/vjepa2/
├── README.md                    # covers both 2 and 2.1
├── vjepa2.Dockerfile            # single file, includes Pillow
├── configs/
│   ├── vjepa2/...               # app: vjepa configs
│   └── vjepa2.1/...             # app: vjepa_2_1 configs
├── scripts/
│   ├── run_train.py             # current scaffold-based launcher
│   ├── generate_synthetic_dataset.py
│   ├── generate_synthetic_images.py
│   ├── prepare_ssv2.py
│   ├── parse_benchmark.py
│   ├── nsys_wrapper.sh
│   └── test_decord.py
└── slurm/                       # one set of sbatches, parameterized via env vars

This is the same pattern called out in CONTRIBUTING.md ("extend the existing test case … rather than creating a parallel directory tree with duplicated Dockerfiles, training scripts, and utilities"). Happy to help if there's a structural reason this is harder than it looks.

Reconsider whether run_train_fsdp.py should ship at all

The FSDP path adds a 718-line bespoke training loop (vjepa2.1/scripts/run_train_fsdp.py), plus a config and sbatch script. The README itself says:

Benchmark finding: FSDP was found to be ~2x slower than baseline DDP in benchmarks (iter ~8,200 ms vs ~4,075 ms on video ranks). … The baseline DDP config is the recommended configuration for V-JEPA 2.1.

If the recommendation is "don't use this", I'd lean toward dropping the FSDP variant from the PR and reintroducing it later if/when it becomes useful. Carrying ~900+ lines of "for reference, but slower" code in awsome-distributed-training increases the maintenance surface (NCCL/PyTorch/EFA upgrades will need to keep it green) without users having a reason to run it. If you want to keep the finding documented, a short paragraph in the README is probably enough.

optim_sweep_b200.sbatch reads as developer experimentation, not a reference workflow

vjepa2.1/slurm/optim_sweep_b200.sbatch (+222 lines) and vjepa2/slurm/optim_sweep_b200.sbatch (+171 lines) iterate over phase configs and tweak env vars to sweep optimizations. They look like the harness you used to produce the benchmark numbers, not something a downstream user would run. The repo convention favors test cases that demonstrate a single, reproducible workflow; sweeps usually live in a personal branch or a README appendix.

I'd suggest either dropping these (along with the phase1/phase4 configs), or — if you want to keep the sweep methodology in-tree — moving it into a tools/ subdirectory with a README that frames it as an experimentation example, not a recommended config.

Comment thread 3.test_cases/pytorch/vjepa2.1/vjepa2_1.Dockerfile
Comment thread 3.test_cases/pytorch/vjepa2.1/slurm/optim_sweep_b200.sbatch Outdated
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Review Batch 2/3 — Deployment Pipeline (K8s / Slurm)

Three concrete issues introduced or surfaced by the most recent commits:

  1. The new set -euo pipefail interacts badly with bare ${PYTHONPATH} in the B200 sbatches — set -u will cause the script to exit before srun if PYTHONPATH is unset on the user's environment (the common case). Fix is the standard ${PYTHONPATH:-} guard, suggested inline on each affected file.
  2. The apt install libnccl-dev … || true line in both Dockerfiles silently swallows failures and doesn't pin a version, while CI requires NCCL >= 2.28 and the base nvcr.io/nvidia/pytorch:25.03-py3 ships NCCL 2.25 (per your own commit message c0fca12). Worth either dropping the || true so failures surface, or making it explicit that NCCL is provided by the base image.
  3. Minor: the K8s image tags vjepa2:vjepa2 / vjepa2:vjepa2.1 use the variant name as the tag, which is unusual. A version-flavored tag (e.g. mirroring the upstream commit pin or the base image date) sets a better template for users.

Comment thread 3.test_cases/pytorch/vjepa2.1/slurm/benchmark_training_b200.sbatch Outdated
Comment thread 3.test_cases/pytorch/vjepa2.1/slurm/benchmark_training_b200_fsdp.sbatch Outdated
Comment thread 3.test_cases/pytorch/vjepa2/slurm/benchmark_training_b200.sbatch Outdated
Comment thread 3.test_cases/pytorch/vjepa2/slurm/benchmark_training_b200_optimized.sbatch Outdated
Comment thread 3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile Outdated
Comment thread 3.test_cases/pytorch/vjepa2.1/vjepa2_1.Dockerfile Outdated
Comment thread 3.test_cases/pytorch/vjepa2/kubernetes/vjepa2-benchmark.yaml Outdated
Comment thread 3.test_cases/pytorch/vjepa2.1/kubernetes/vjepa2-1-benchmark.yaml Outdated
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Review Batch 3/3 — Documentation Consistency + Things That Look Great

One cross-cutting documentation item, plus inline picks below:

B200 setup is documented only inside the sbatch comments

The B200 sbatches (vjepa2/slurm/benchmark_training_b200.sbatch, _optimized.sbatch, vjepa2.1/slurm/benchmark_training_b200.sbatch, _fsdp.sbatch) require a separate NeMo container, a cloned vjepa2_code/ directory on FSx, and a pip install --target /fsx/.../vjepa_deps step. None of this is mentioned in either README, so a user going directly through Build ContainerRun Benchmark with B200 instances will hit unexplained CONTAINER_IMAGE paths. I'd suggest adding a short B200 setup subsection to the README, or at least pointing to the sbatch's header comment from the README.


Things That Look Great

  • Genuine engagement with the previous review — all the small/medium items are visibly fixed, with commit messages that explain the why (e.g. 9a946379 calls out the cluster container freeze as the source for pinned versions).
  • The B200 vs H200 split is well-motivated — keeping the H200 Dockerfile path simple while running B200 against a NeMo container with a known-compatible NCCL/EFA stack is a pragmatic choice and the commit message c0fca12 explains it clearly.
  • Optimization knobs in run_train.py are tasteful — env-var-gated monkey-patches (VJEPA_TF32, VJEPA_FUSED_OPTIMIZER, VJEPA_GRAD_BUCKET_VIEW, VJEPA_PREFETCH_FACTOR) keep the upstream code unmodified while making perf experiments easy to compose.
  • GradScaler-for-BF16 fix with the subclass approach (so Apex's GradScaler still inherits cleanly) is a nice touch and documented inline.
  • .gitignore for benchmarks/ and profiling/ keeps generated artifacts out of the repo while still letting you keep the directory referenced from the README.
  • Honest documentation of negative results — calling out that FSDP was 2× slower and torch.compile + activation checkpointing was 55% slower is exactly the kind of empirical guidance that makes this repo useful (even if I think the FSDP code itself probably doesn't need to ship — see Batch 1).
  • Synthetic-dataset sizing guidance in the README (50K minimum, with the rationale about dataloader re-init) is the kind of subtle benchmarking pitfall that's exactly worth documenting.

Suggested priority before merge

  1. Consolidate vjepa2/ and vjepa2.1/ into a single test case (Batch 1) — the headline blocker, and the rationale for several other findings.
  2. Fix the set -u × ${PYTHONPATH} bug in the four B200 sbatches (Batch 2) — small but currently breaks the scripts on common environments.
  3. Decide what to do with run_train_fsdp.py and optim_sweep_b200.sbatch (Batch 1) — drop or relocate.
  4. Address the smaller documentation/Dockerfile items at your discretion.

Comment thread 3.test_cases/pytorch/vjepa2.1/README.md
Comment thread 3.test_cases/pytorch/vjepa2/README.md Outdated
Comment thread 3.test_cases/pytorch/vjepa2.1/README.md Outdated
KeitaW and others added 2 commits April 28, 2026 13:53
… to experimental

- Fix ${PYTHONPATH:-} in all B200 sbatch scripts to prevent unbound variable errors
- Deduplicate shared scripts (5 files) in vjepa2.1/ with symlinks to vjepa2/
- Parameterize parse_benchmark.py with --model_name flag for both V-JEPA 2/2.1
- Remove optim_sweep_b200.sbatch (developer tooling with hardcoded internal paths)
- Move FSDP code to experimental/ subdirectories (documented as ~2x slower than DDP)
- Add explanatory comments to Dockerfile || true for libnccl-dev
- Add B200 GPU setup prerequisites to both READMEs
- Fix dead references to gitignored benchmarks/ directory
@paragao
Copy link
Copy Markdown
Contributor Author

paragao commented May 1, 2026

@KeitaW addressed comments. I decided to maintain both vjepa2 and vjepa2.1 folders, as this is how they are tracked on the original forked repo. Those are two different approaches to world models, where vjepa2 only uses images and vjepa2.1 combines videos and images. All duplication removed.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/4 — Structure & Repository Hygiene (round 3)

Thanks for the round-2 follow-up commit (5ae3381a). The PYTHONPATH unbound-variable bug is fixed across the B200 sbatches, the developer-only optim_sweep_b200.sbatch is gone, the FSDP code has moved to experimental/, B200 prerequisites are now in both READMEs, and five utility scripts in vjepa2.1/scripts/ are now symlinks rather than copies. That's solid cleanup.

Two items I'd like to land before merge:

  1. Drop the experimental/ FSDP path entirely — relocating it didn't address the round-2 concern. If it's documented as slower than DDP, it's not a recommended workflow and shouldn't ship in awsome-distributed. Defer to a follow-up PR if/when FSDP becomes the recommendation. (Detailed in Batch 3.)
  2. Collapse the vjepa2/ and vjepa2.1/ directory split — symlinks for 5 scripts help, but the Dockerfiles, sbatches, configs, and READMEs are still parallel, and the 2.1 README now links across to the 2 README for several core sections.

A few smaller items (NCCL pin under the CI floor, K8s tag template, duplicate ## 7. heading, orphan phase configs, stale "see the V-JEPA 2 README for benchmark results" pointer) are easy fixes I'd like to land in the same pass.


Resolution Status of Prior Findings

# Finding (from round 2) Status
1 Code duplication between vjepa2/ and vjepa2.1/ 🟡 Partial — 5 scripts symlinked in 5ae3381a; Dockerfiles, sbatches, configs, READMEs still parallel
2 run_train_fsdp.py (718 lines) probably shouldn't ship ❌ Open — moved to scripts/experimental/ but kept in-tree; round-2 ask was to drop it, not to relocate it. Please remove and defer to a follow-up PR (see Batch 3).
3 optim_sweep_b200.sbatch is developer tooling ✅ Resolved — removed in 5ae3381a
4 set -euo pipefail + bare ${PYTHONPATH} ✅ Resolved — ${PYTHONPATH:-} guard added in all B200 sbatches
5 NCCL availability silent-failed and not pinned to CI floor 🟡 Partial — comment added but base image still ships NCCL 2.25 (below CI floor 2.28). See Batch 2.
6 K8s image tag vjepa2:vjepa2.1 is awkward ❌ Open — still present. See Batch 2.
7 Per-GPU video batch size reported as both "24" and "48" ❌ Open — still inconsistent. See Batch 4.
8 Reference to gitignored benchmark file is dead 🟡 Partial — direct path removed but README now points readers to a section in vjepa2/README.md that doesn't contain results. See Batch 4.
9 B200 path documented only inside sbatch comments ✅ Resolved — Section 7 in vjepa2/README.md, Section 6 in vjepa2.1/README.md

Legend: ✅ Resolved · 🟡 Partial · ❌ Open · ➖ N/A (scope changed)


Cross-cutting: duplication — symlinks cover ~5 files, but the parallel-tree problem is structurally still here

Files: 3.test_cases/pytorch/vjepa2/ and 3.test_cases/pytorch/vjepa2.1/

The symlink pass in 5ae3381a is a real improvement — generate_synthetic_dataset.py, nsys_wrapper.sh, parse_benchmark.py, prepare_ssv2.py, and test_decord.py are no longer duplicated. But the structural duplication that round 2 flagged is still mostly intact:

  • Dockerfiles: vjepa2.Dockerfile and vjepa2_1.Dockerfile differ by one line (Pillow==12.0.0) and a comment header. The vjepa2_1.Dockerfile even says in its own header "V-JEPA 2.1 reuses the same container as V-JEPA 2, since both training apps live in the same facebookresearch/vjepa2 repository" — which reads as a strong signal it shouldn't be a separate file.
  • Slurm scripts: vjepa2/slurm/ has 7 sbatches and vjepa2.1/slurm/ has 5 (+1 in experimental/). The B200 variants are near-identical skeletons differing only in paths, job names, and a CODE_MOUNT=/workspace overlay for 2.1.
  • Configs: 3 configs in vjepa2/configs/ and 5 in vjepa2.1/configs/ (including two phase configs flagged inline).
  • READMEs: Both files repeat the prerequisite list, build-container steps, and benchmarking workflow; the 2.1 README also reaches across the boundary to ../vjepa2/README.md#2-dataset-something-something-v2-ssv2, ../vjepa2/README.md#7-b200-gpu-setup, and ../vjepa2/README.md#10-data-loading-optimization-torchcodec.

I'd really like to land the consolidation suggested in round 2 before merge — a single vjepa2/ test case with the app field in YAML configs selecting between vjepa and vjepa_2_1 (which run_train.py already supports via app.scaffold.main()). The current state asks future readers — and future maintainers when EFA / NCCL / PyTorch versions bump — to keep two near-mirrored trees in sync, and the cross-README links make it impossible to treat either directory as self-contained. The repo convention in CONTRIBUTING.md is explicit on this: extend the existing test case rather than mirror it.

If you'd like a smaller intermediate step, even just collapsing the two Dockerfiles into one (and dropping vjepa2_1.Dockerfile entirely) would remove one of the highest-risk drift points. Happy to pair on this if useful.


## Shared Scripts

Several utility scripts (`generate_synthetic_dataset.py`, `nsys_wrapper.sh`, `prepare_ssv2.py`, `test_decord.py`) are shared with the [V-JEPA 2 test case](../vjepa2/) and symlinked from `../vjepa2/scripts/`. Ensure both directories are present when cloning.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symlinks for cross-directory script sharing have rough edges

Symlinks deduplicate the bytes but introduce three caveats worth knowing about:

  1. Windows clones: Git on Windows requires core.symlinks=true (off by default) — otherwise the symlinks land as 50-byte text files containing the target path. Anyone running git clone on a Windows workstation and copying the test case to a cluster will silently get broken scripts.
  2. Self-contained convention: The README already calls out "Ensure both directories are present when cloning" on this line. That instruction is the tell — it means the 2.1 test case can no longer be used without the sibling directory, which conflicts with the repo's "each test case must be self-contained" convention.
  3. Container COPY: If any future Dockerfile does COPY scripts/ /vjepa2/scripts/, the symlinks will be resolved before COPY in some BuildKit modes and not in others — easy to hit a "works on my machine" bug.

I'd suggest treating the symlinks as a temporary measure that the structural consolidation (see batch body) replaces cleanly. If consolidation isn't on the table for this PR, at minimum I'd surface the Windows-symlinks caveat in the README so it doesn't bite a reader.

@@ -0,0 +1,168 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orphan phase configs in vjepa2.1/configs/

optim_sweep_b200.sbatch was the script that consumed phase1 and phase4 configs, and it was removed in 5ae3381a. The configs are still there, but nothing in slurm/, scripts/, or either README references them as runnable workflows. They're effectively documentation-only snapshots of mid-sweep tuning state at this point.

I'd suggest either:

  • Removing them along with the sweep script (they're recoverable from git history if you want to revisit), or
  • Folding them into the README as an "optimization journey" appendix with a sentence each explaining what they were testing.

Carrying named YAML files in configs/ implies they're consumable configurations, which they're not after the sweep harness left. Same applies to benchmark-vitg-8nodes-phase4.yaml.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/4 — Deployment Pipeline (K8s / Slurm) (round 3)

Two items carried over from round 2 — the NCCL version mismatch with the CI floor, and the awkward K8s image tag template. Inline.

# Install NCCL dev headers for EFA plugin compilation.
# The base NVIDIA image already provides libnccl.so; this is best-effort
# to get headers and will not fail the build if unavailable.
RUN apt-get update && apt-get install -y libnccl-dev && rm -rf /var/lib/apt/lists/* || true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL pin still gaps the CI floor (>= 2.28)

The follow-up commit added a helpful comment to this line, which is good. But the underlying issue from round 2 is unchanged: the base image (nvcr.io/nvidia/pytorch:25.03-py3) ships NCCL 2.25, and the repo's CI version check requires NCCL >= 2.28. Your own commit message in c0fca12 calls this out — "the standard container ships NCCL 2.25 ... B200 scripts use a NeMo container with NCCL >= 2.29 instead" — and vjepa2/README.md:205 repeats the same statement for users.

So the H200 path is shipping a container that doesn't satisfy the repo's NCCL floor. Two ways to fix:

  1. Bump the base image to one that ships NCCL 2.28+ (e.g., a later nvcr.io/nvidia/pytorch:25.xx-py3 tag), or
  2. Rebuild NCCL from source in the Dockerfile to the pinned version (the pattern most other GPU Dockerfiles in this repo use — see micro-benchmarks/nccl-tests/nccl-tests.Dockerfile).

The current apt-get install libnccl-dev || true won't fix it — the base image's libnccl.so is still 2.25 regardless of what dev headers are installed alongside it.

Same change needed in vjepa2.1/vjepa2_1.Dockerfile:25.

containers:
- name: vjepa2
# Replace with your ECR image URI
image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:vjepa2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K8s image tag pattern vjepa2:vjepa2 is still confusing

Carrying over from round 2. Using the variant name as the tag with the same string as the repo (vjepa2:vjepa2) is unusual — most readers will expect the variant or a version to be the tag, not duplicated. Since the Dockerfile pins VJEPA2_COMMIT=204698b4..., a much clearer template would be a commit-derived or base-image-derived tag:

Suggested change
image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:vjepa2
image: <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/vjepa2:204698b4

Or a date tag mirroring the base image (vjepa2:25.03-py3). Same fix in vjepa2.1/kubernetes/vjepa2-1-benchmark.yaml:41 (current vjepa2:vjepa2.1vjepa2:204698b4 or similar). Minor, but it sets the right copy-paste template for users who'll likely tag with :latest or :dev otherwise.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/4 — Training Code Quality (round 3)

One item — the experimental FSDP path. Inline.

@@ -0,0 +1,718 @@
#!/usr/bin/env python3
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop the experimental/ FSDP path from this PR and defer to a follow-up

Relocating to experimental/ reads as a compromise on round 2's ask, but I'd rather we just delete the experimental code from this PR entirely. The experimental/README.md itself says the FSDP launcher is "preserved for reference and future experimentation only" and that DDP is the production recommendation — which is the same argument for not landing it in awsome-distributed at all. As written, this is ~900 LOC (script + sbatch + config + README) that the repo has to keep building across NCCL/PyTorch/EFA bumps in exchange for a path that's documented as 2× slower.

Concretely, I'd like to see in the next push:

  1. rm -r 3.test_cases/pytorch/vjepa2.1/scripts/experimental/
  2. rm -r 3.test_cases/pytorch/vjepa2.1/configs/experimental/
  3. rm -r 3.test_cases/pytorch/vjepa2.1/slurm/experimental/
  4. In vjepa2.1/README.md, drop the ## 7. FSDP Experimental Config (B200) section (which also resolves the duplicate-## 7. heading in Batch 4) — if you want the finding documented, a one-paragraph note saying "FSDP was benchmarked at ~2× slower than DDP for V-JEPA 2.1 ViT-g/16" under Architecture Notes is plenty.
  5. When FSDP becomes the recommended path (or is at parity with DDP), open a follow-up PR that brings just the launcher + a config + a sbatch script back, framed as a supported workflow rather than a "for reference" snapshot.

This is the standard "experimental code belongs in a feature branch, not in main" position — production reference architectures should reflect what users should actually run, not the search history that produced them.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 4/4 — Documentation Consistency (round 3)

Three doc items inline below.


Things That Look Great

  • Genuine round-2 follow-through: 5ae3381a's commit message is unusually specific about what was changed and why (PYTHONPATH unbound-variable, dedup-via-symlinks, FSDP→experimental, optim_sweep removal, B200 prereqs documented). That's the easy-to-review style I appreciate.
  • B200 prerequisites in the README are now in the right place — Section 7 of vjepa2/README.md reads cleanly and Section 6 of vjepa2.1/README.md correctly points back at it. Round 2 had this material trapped inside sbatch header comments.
  • Experimental subdirectory pattern is a tidy way to keep the FSDP finding accessible — even though I'm asking for it to be removed from this PR (see Batch 3), the framing in scripts/experimental/README.md is short, direct, and honest about which path is recommended.
  • Diagnostic env-var hooks in run_train.py (VJEPA_LOG_SDPA, plus the existing VJEPA_TF32, VJEPA_FUSED_OPTIMIZER, VJEPA_GRAD_BUCKET_VIEW, VJEPA_PREFETCH_FACTOR) compose nicely — perf experimentation is one-line in sbatch instead of patches to the launcher.
  • TorchCodec investigation results in the V-JEPA 2 README Section 10 are well-written — "data_time reduction" table with concrete numbers and an explanation of why (seek_mode="approximate", cycling sampler, no epoch boundary stalls) is exactly the kind of empirical content this repo wants.
  • PYTHONPATH guard everywhere it mattered: the ${PYTHONPATH:-} fix made it into all four affected B200 sbatches, not just the one I'd cited — appreciate the breadth of the fix.

> For production training, implement FSDP-aware checkpointing with `StateDictType.FULL_STATE_DICT`
> or `SHARDED_STATE_DICT`.

## 7. Parse Results
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate ## 7. heading

The numbered sections currently read ## 6.## 7. FSDP Experimental Config (B200) (line 154) → ## 7. Parse Results (this line) → ## 8. Profiling with nsys. Two ## 7.s break GitHub's auto-generated anchors (the second one gets #7-parse-results-1, which any external link to "Parse Results" will miss) and the rendered TOC.

Suggested change
## 7. Parse Results
## 8. Parse Results

Then the existing ## 8. Profiling with nsys becomes ## 9.. Quick fix; surfacing because both READMEs are referenced from the top-level test case index.

Note: if you take the Batch 3 ask and remove the FSDP section entirely, this conflict resolves itself — the ## 7. Parse Results heading on this line is already correct in that case.

tail -f logs/vjepa21_benchmark/<JOB_ID>.out
```

The benchmark runs 200 iterations across 8 nodes (64 GPUs). With `rank_ratio=0.5`, ranks 0-31 process images (batch size 72 per GPU) and ranks 32-63 process video (batch size 24 per GPU).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-GPU video batch size is still reported two ways (carry-over from round 2)

This line says:

...ranks 32-63 process video (batch size 24 per GPU).

But line 207 (Architecture Notes / "Image/Video co-training rank split") says:

Ranks 32-63 (32 GPUs): process video, batch size 48/GPU = 1,536 videos/step

These are the same two numbers from round 2, and the explanation is the same — this line is the YAML-configured batch_size, line 207 is the effective post-auto-adjust per-rank batch when rank_ratio=0.5 halves the video-rank count. Right now a reader will hit one number, then another, and reasonably conclude the README contradicts itself.

The cleanest fix is one of:

  • Drop the "batch size 24 per GPU" parenthetical here and let the Architecture Notes section be the authoritative source for batch math; or
  • Add the "(effective 48/GPU after auto-doubling — see Architecture Notes)" qualifier inline so the two numbers reconcile in-place.

Comment thread 3.test_cases/pytorch/vjepa2.1/README.md Outdated

## Benchmark Results

See the V-JEPA 2 README for detailed benchmark results comparing H200 and B200 performance for both V-JEPA 2 and V-JEPA 2.1. Results are generated locally using `parse_benchmark.py` after running benchmark jobs.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"See the V-JEPA 2 README for detailed benchmark results" points at a section that doesn't exist

The corresponding section in vjepa2/README.md (around line 360–362) doesn't actually contain benchmark numbers — it just says results are generated locally:

Benchmark results comparing H200 and B200 performance are generated locally in the benchmarks/ directory after running the benchmark sbatch scripts.

So this README now points to the V-JEPA 2 README, which then points the reader at a gitignored benchmarks/ directory that won't exist on a fresh clone. The PR description has a small benchmark table (8x p5en, 64x H200, BF16, peak ~32.9 GB/GPU); if you'd like in-tree results, surfacing that table in vjepa2/README.md directly — even as a small "headline numbers" block — would close the loop. Otherwise I'd just drop the cross-reference rather than send readers chasing a link.

…x Dockerfiles

- Remove experimental/ FSDP path (scripts, configs, slurm) from vjepa2.1
  FSDP was benchmarked at ~2x slower than DDP; removed per reviewer request
- Remove orphan phase configs (phase1, phase4) no longer used by any workflow
- Fix Dockerfiles: remove || true on libnccl-dev install, add B200 note
- Fix K8s image tags: use pinned commit hash (204698b4) instead of confusing
  repo:variant naming
- Add mkdir -p for log directories in all sbatch scripts for consistency
- Add directory structure note explaining vjepa2/ and vjepa2.1/ separation
  mirrors upstream facebookresearch/vjepa2 app structure
- Fix README inconsistencies: remove benchmarks/ dir references (not committed),
  fix section numbering, fix batch size claim, update nsys examples
@KeitaW KeitaW merged commit 1faaf6a into main May 20, 2026
5 checks passed
@KeitaW KeitaW deleted the feat/vjepa2-distributed-training branch May 20, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants