Skip to content

fix(diffusion): honor local checkpoint dirs and task-derived model id in WAN inference#4408

Open
huvunvidia wants to merge 6 commits into
mainfrom
huvu/wan_inference_nvbug
Open

fix(diffusion): honor local checkpoint dirs and task-derived model id in WAN inference#4408
huvunvidia wants to merge 6 commits into
mainfrom
huvu/wan_inference_nvbug

Conversation

@huvunvidia

Copy link
Copy Markdown
Contributor

What

Fixes two independent failures when running examples/models/wan/inference_wan.py --task t2v-1.3B on an air-gapped node (no outbound internet).

Bug 1 — Missing easydict dependency

inference_wan.py imports easydict unconditionally at module load, but the package is absent from the container (it was a transitive dep dropped when the CVE-bearing codecs av/imageio/imageio-ffmpeg were removed). The script crashes at startup with ModuleNotFoundError: No module named 'easydict'.

Fix: add easydict to scripts/install_diffusion_deps.sh (alongside the codecs) and document inference prerequisites in the WAN README.

Bug 2 — Hardcoded 14B model id / ignored --t5_checkpoint_dir & --vae_checkpoint_dir

FlowInferencePipeline accepted t5_checkpoint_dir / vae_checkpoint_dir but never used them — all from_pretrained calls resolved the hardcoded Wan-AI/Wan2.1-T2V-14B-Diffusers hub id. Offline runs hit the HF hub and crashed with LocalEntryNotFoundError, even with the correct local files supplied. inference_wan.py also hardcoded the 14B id regardless of --task.

Fix:

  • flow_inference_pipeline.py: resolve the text encoder, tokenizer, VAE, and scheduler from the provided local dirs, falling back to model_id. (The scheduler from_pretrained in generate() also used model_id and would have failed offline — it is now covered too.)
  • inference_wan.py: derive model_id from --task via a TASK_TO_MODEL_ID map instead of hardcoding 14B.
  • README: document --t5_checkpoint_dir / --vae_checkpoint_dir and add an offline-inference example.

Notes

  • easydict is intentionally not added to pyproject.toml as a required dependency, consistent with how the CVE-bearing codecs are handled (example/test-time install only). Happy to make it an optional extra in a separate dependency PR if preferred.

Testing

  • ruff check and ruff format --check pass on the changed Python files.
  • Behavioral fix verified by code inspection against the offline reproduction in the bug report; full offline inference run requires a GPU node with the WAN checkpoints.

🤖 Generated with Claude Code

… in WAN inference

WAN inference failed on air-gapped nodes in two ways:

- easydict was imported unconditionally in inference_wan.py but absent from
  the container; add it to scripts/install_diffusion_deps.sh and document the
  inference prerequisites in the README.
- FlowInferencePipeline ignored the t5_checkpoint_dir / vae_checkpoint_dir
  params and always resolved the hardcoded 14B hub id, so offline runs hit the
  HF hub and crashed. Resolve text encoder, tokenizer, VAE, and scheduler from
  the provided local dirs (falling back to model_id), and derive model_id from
  --task instead of hardcoding 14B.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Huy Vu <huvu@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review

Clean bug-fix PR -- both fixes are correct and the README additions are helpful. Two small items:

1. Docstring inaccuracy (pre-existing, good to fix here)

flow_inference_pipeline.py lines 97-100: the docstrings for t5_checkpoint_dir and vae_checkpoint_dir say 'falls back to checkpoint_dir if None' but the actual fallback (lines 125-126) is model_id. Now that this PR wires up these parameters, worth updating the docstrings to say 'falls back to model_id if None'.

2. Test helper missing scheduler_source

The _make_pipeline helper in tests/unit_tests/diffusion/model/wan/flow_matching/test_flow_inference_pipeline.py (line 52) sets model_id but not the new scheduler_source attribute. The existing generate tests still pass because FlowMatchEulerDiscreteScheduler.from_pretrained is monkeypatched to ignore its args, but any future test that exercises the scheduler path without that monkeypatch will hit AttributeError. Suggest adding scheduler_source next to the existing model_id default in the helper.

Suggested test cases

No perf tests impacted.

Generated with Claude Code

@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 353214e

@yaoyu-33 yaoyu-33 added area:diffusion DFM module bug Something isn't working needs-more-tests Requires additional L0 and L1 test coverage before merge needs-review PR is ready for code review and waiting on a reviewer labels Jun 22, 2026
huvunvidia and others added 2 commits June 25, 2026 10:17
…ion in WAN inference

The generate() loop now reads self.scheduler_source; add it to the test helper
that bypasses __init__ so the existing TestGenerate cases pass, and add
TestComponentDirResolution to verify __init__ resolves the text encoder,
tokenizer, VAE, and scheduler from --t5_checkpoint_dir / --vae_checkpoint_dir
with fallback to model_id.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Huy Vu <huvu@nvidia.com>
…ion tests

Trim the inference README to the easydict prerequisite and the basic example.
Remove the added TestComponentDirResolution cases; keep only the scheduler_source
default in the test helper so the existing TestGenerate cases pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Huy Vu <huvu@nvidia.com>
@huvunvidia

Copy link
Copy Markdown
Contributor Author

Fixed the Launch_Unit_Tests_Diffusion failure: generate() now reads self.scheduler_source, which the TestGenerate helper (it bypasses __init__) didn't set — added it to the helper.

Note: the Launch_Unit_Tests_Core failure (test_gemma4_vl_modeling.py::TestScatterModalityFeatures::test_forward_scatters_sequence_parallel_decoder_inputfake_scatter() got an unexpected keyword argument 'group') is unrelated to this PR. It is in the Gemma-4 VL modeling tests and touches no WAN/diffusion code changed here; it appears pre-existing on main.

/ok to test 0ce84dc

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Huy Vu <huvu@nvidia.com>
@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 546f0ea

@huvunvidia

Copy link
Copy Markdown
Contributor Author

Confirmed manually: with this PR, WAN t2v-1.3B inference runs fully offline (HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1) — easydict imports (Bug 1) and the text encoder / tokenizer / VAE / scheduler all load from the local --t5_checkpoint_dir / --vae_checkpoint_dir instead of the hardcoded 14B hub id (Bug 2). Re-triggering CI on the latest commit.

/ok to test 546f0ea

@huvunvidia

Copy link
Copy Markdown
Contributor Author

Merged latest main (branch was 62 commits behind). The prior functional-test failures were all exit-127 No such file or directory for the launch scripts — the CI matrix is generated from the branch head while tests run in the container built from the PR-merged-with-main commit, and main had reorganized tests/functional_tests/launch_scripts/ (e.g. L0_Launch_training.sh split into _pretrain/_finetune/etc., several L0_* retiered to L1_*/L2_*). The merge realigns them. No conflicts; main never touched the WAN files in this PR.

/ok to test d935ded

@huvunvidia

Copy link
Copy Markdown
Contributor Author

Merged latest main again (1 commit behind; it updated .github/workflows/cicd-main.yml). No conflicts, no overlap with the WAN files in this PR.

/ok to test 3469f8f

@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 3469f8f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:diffusion DFM module bug Something isn't working needs-more-tests Requires additional L0 and L1 test coverage before merge needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants