fix(diffusion): honor local checkpoint dirs and task-derived model id in WAN inference by huvunvidia · Pull Request #4408 · NVIDIA-NeMo/Megatron-Bridge

huvunvidia · 2026-06-17T21:34:30Z

What

Fixes two independent failures when running examples/models/wan/inference_wan.py --task t2v-1.3B on an air-gapped node (no outbound internet).

Bug 1 — Missing `easydict` dependency

inference_wan.py imports easydict unconditionally at module load, but the package is absent from the container (it was a transitive dep dropped when the CVE-bearing codecs av/imageio/imageio-ffmpeg were removed). The script crashes at startup with ModuleNotFoundError: No module named 'easydict'.

Fix: add easydict to scripts/install_diffusion_deps.sh (alongside the codecs) and document inference prerequisites in the WAN README.

Bug 2 — Hardcoded 14B model id / ignored `--t5_checkpoint_dir` & `--vae_checkpoint_dir`

FlowInferencePipeline accepted t5_checkpoint_dir / vae_checkpoint_dir but never used them — all from_pretrained calls resolved the hardcoded Wan-AI/Wan2.1-T2V-14B-Diffusers hub id. Offline runs hit the HF hub and crashed with LocalEntryNotFoundError, even with the correct local files supplied. inference_wan.py also hardcoded the 14B id regardless of --task.

Fix:

flow_inference_pipeline.py: resolve the text encoder, tokenizer, VAE, and scheduler from the provided local dirs, falling back to model_id. (The scheduler from_pretrained in generate() also used model_id and would have failed offline — it is now covered too.)
inference_wan.py: derive model_id from --task via a TASK_TO_MODEL_ID map instead of hardcoding 14B.
README: document --t5_checkpoint_dir / --vae_checkpoint_dir and add an offline-inference example.

Notes

easydict is intentionally not added to pyproject.toml as a required dependency, consistent with how the CVE-bearing codecs are handled (example/test-time install only). Happy to make it an optional extra in a separate dependency PR if preferred.

Testing

ruff check and ruff format --check pass on the changed Python files.
Behavioral fix verified by code inspection against the offline reproduction in the bug report; full offline inference run requires a GPU node with the WAN checkpoints.

🤖 Generated with Claude Code

… in WAN inference WAN inference failed on air-gapped nodes in two ways: - easydict was imported unconditionally in inference_wan.py but absent from the container; add it to scripts/install_diffusion_deps.sh and document the inference prerequisites in the README. - FlowInferencePipeline ignored the t5_checkpoint_dir / vae_checkpoint_dir params and always resolved the hardcoded 14B hub id, so offline runs hit the HF hub and crashed. Resolve text encoder, tokenizer, VAE, and scheduler from the provided local dirs (falling back to model_id), and derive model_id from --task instead of hardcoding 14B. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu <huvu@nvidia.com>

copy-pr-bot · 2026-06-17T21:34:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-06-17T21:38:57Z

Review

Clean bug-fix PR -- both fixes are correct and the README additions are helpful. Two small items:

1. Docstring inaccuracy (pre-existing, good to fix here)

flow_inference_pipeline.py lines 97-100: the docstrings for t5_checkpoint_dir and vae_checkpoint_dir say 'falls back to checkpoint_dir if None' but the actual fallback (lines 125-126) is model_id. Now that this PR wires up these parameters, worth updating the docstrings to say 'falls back to model_id if None'.

2. Test helper missing scheduler_source

The _make_pipeline helper in tests/unit_tests/diffusion/model/wan/flow_matching/test_flow_inference_pipeline.py (line 52) sets model_id but not the new scheduler_source attribute. The existing generate tests still pass because FlowMatchEulerDiscreteScheduler.from_pretrained is monkeypatched to ignore its args, but any future test that exercises the scheduler path without that monkeypatch will hit AttributeError. Suggest adding scheduler_source next to the existing model_id default in the helper.

Suggested test cases

No perf tests impacted.

Generated with Claude Code

huvunvidia · 2026-06-17T22:08:52Z

/ok to test 353214e

…ion in WAN inference The generate() loop now reads self.scheduler_source; add it to the test helper that bypasses __init__ so the existing TestGenerate cases pass, and add TestComponentDirResolution to verify __init__ resolves the text encoder, tokenizer, VAE, and scheduler from --t5_checkpoint_dir / --vae_checkpoint_dir with fallback to model_id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu <huvu@nvidia.com>

…ion tests Trim the inference README to the easydict prerequisite and the basic example. Remove the added TestComponentDirResolution cases; keep only the scheduler_source default in the test helper so the existing TestGenerate cases pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu <huvu@nvidia.com>

huvunvidia · 2026-06-25T17:29:28Z

Fixed the Launch_Unit_Tests_Diffusion failure: generate() now reads self.scheduler_source, which the TestGenerate helper (it bypasses __init__) didn't set — added it to the helper.

Note: the Launch_Unit_Tests_Core failure (test_gemma4_vl_modeling.py::TestScatterModalityFeatures::test_forward_scatters_sequence_parallel_decoder_input — fake_scatter() got an unexpected keyword argument 'group') is unrelated to this PR. It is in the Gemma-4 VL modeling tests and touches no WAN/diffusion code changed here; it appears pre-existing on main.

/ok to test 0ce84dc

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu <huvu@nvidia.com>

huvunvidia · 2026-06-25T17:31:37Z

/ok to test 546f0ea

huvunvidia · 2026-06-25T20:04:31Z

Confirmed manually: with this PR, WAN t2v-1.3B inference runs fully offline (HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1) — easydict imports (Bug 1) and the text encoder / tokenizer / VAE / scheduler all load from the local --t5_checkpoint_dir / --vae_checkpoint_dir instead of the hardcoded 14B hub id (Bug 2). Re-triggering CI on the latest commit.

/ok to test 546f0ea

huvunvidia · 2026-06-26T19:41:52Z

Merged latest main (branch was 62 commits behind). The prior functional-test failures were all exit-127 No such file or directory for the launch scripts — the CI matrix is generated from the branch head while tests run in the container built from the PR-merged-with-main commit, and main had reorganized tests/functional_tests/launch_scripts/ (e.g. L0_Launch_training.sh split into _pretrain/_finetune/etc., several L0_* retiered to L1_*/L2_*). The merge realigns them. No conflicts; main never touched the WAN files in this PR.

/ok to test d935ded

huvunvidia · 2026-06-26T19:51:28Z

Merged latest main again (1 commit behind; it updated .github/workflows/cicd-main.yml). No conflicts, no overlap with the WAN files in this PR.

/ok to test 3469f8f

huvunvidia · 2026-06-26T20:38:51Z

/ok to test 3469f8f

copy-pr-bot Bot temporarily deployed to public June 17, 2026 22:09 Inactive

copy-pr-bot Bot temporarily deployed to test June 17, 2026 22:09 Inactive

copy-pr-bot Bot temporarily deployed to public June 17, 2026 22:19 Inactive

copy-pr-bot Bot temporarily deployed to public June 17, 2026 23:04 Inactive

yaoyu-33 added area:diffusion DFM module bug Something isn't working needs-more-tests Requires additional L0 and L1 test coverage before merge needs-review PR is ready for code review and waiting on a reviewer labels Jun 22, 2026

huvunvidia and others added 2 commits June 25, 2026 10:17

chore(diffusion): trim install_diffusion_deps.sh comment

546f0ea

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu <huvu@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 25, 2026 17:32 Inactive

copy-pr-bot Bot temporarily deployed to test June 25, 2026 17:33 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 18:06 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 18:07 Inactive

Merge remote-tracking branch 'origin/main' into huvu/wan_inference_nvbug

d935ded

Merge remote-tracking branch 'origin/main' into huvu/wan_inference_nvbug

3469f8f

copy-pr-bot Bot temporarily deployed to public June 26, 2026 20:39 Inactive

copy-pr-bot Bot temporarily deployed to test June 26, 2026 20:39 Inactive

copy-pr-bot Bot temporarily deployed to public June 26, 2026 20:57 Inactive

copy-pr-bot Bot temporarily deployed to public June 26, 2026 21:22 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(diffusion): honor local checkpoint dirs and task-derived model id in WAN inference#4408

fix(diffusion): honor local checkpoint dirs and task-derived model id in WAN inference#4408
huvunvidia wants to merge 6 commits into
mainfrom
huvu/wan_inference_nvbug

huvunvidia commented Jun 17, 2026

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

claude Bot commented Jun 17, 2026

Uh oh!

huvunvidia commented Jun 17, 2026

Uh oh!

huvunvidia commented Jun 25, 2026

Uh oh!

huvunvidia commented Jun 25, 2026

Uh oh!

huvunvidia commented Jun 25, 2026

Uh oh!

huvunvidia commented Jun 26, 2026

Uh oh!

huvunvidia commented Jun 26, 2026

Uh oh!

huvunvidia commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

huvunvidia commented Jun 17, 2026

What

Bug 1 — Missing easydict dependency

Bug 2 — Hardcoded 14B model id / ignored --t5_checkpoint_dir & --vae_checkpoint_dir

Notes

Testing

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

claude Bot commented Jun 17, 2026

Review

Uh oh!

huvunvidia commented Jun 17, 2026

Uh oh!

huvunvidia commented Jun 25, 2026

Uh oh!

huvunvidia commented Jun 25, 2026

Uh oh!

huvunvidia commented Jun 25, 2026

Uh oh!

huvunvidia commented Jun 26, 2026

Uh oh!

huvunvidia commented Jun 26, 2026

Uh oh!

huvunvidia commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug 1 — Missing `easydict` dependency

Bug 2 — Hardcoded 14B model id / ignored `--t5_checkpoint_dir` & `--vae_checkpoint_dir`