[bugfix]: fix SP deadlock in negative prompt encoding during training#1178
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical deadlock issue encountered during distributed training, specifically when using sequence parallelism. The previous implementation of negative prompt encoding inadvertently created a bottleneck by centralizing the process on a single rank, causing other ranks to wait indefinitely for a collective operation that never fully materialized. The updated approach decentralizes this encoding, allowing each rank to process its negative prompt independently, thereby resolving the deadlock and ensuring smooth distributed training. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively resolves a deadlock issue in distributed training with sequence parallelism by refactoring the negative prompt encoding. The previous approach of encoding on a single rank and broadcasting the results is replaced by independent encoding on each rank. This change simplifies the code and eliminates complex collective communication. The implementation is solid, but I have one suggestion to improve robustness by using the configured precision for the text encoder instead of a hardcoded value.
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 PR merge requirementsWonderful, this rule succeeded.
|
|
/merge |
Addresses review feedback on hao-ai-lab#1178: the previous fix loaded the negative prompt encoder via transformers' T5EncoderModel, but Wan's text_encoder is UMT5 (per-layer relative position bias, not shared). Loading UMT5 weights into a T5 architecture silently produces wrong embeddings for the negative prompt and diverges the training-time CFG from inference. Switch to TextEncoderLoader so the encoder class is resolved from pipeline_config (UMT5EncoderModel for Wan) and the postprocess_text function is reused instead of imported by name. This keeps the fix to the original SP deadlock (every rank encodes independently; no full WanPipeline construction, no NCCL collectives) while staying inside the existing prompt-encoding abstraction. text_encoder_cpu_offload is forced off for this short-lived load to avoid initializing an FSDP device mesh, which would re-introduce collectives.
|
@SolitaryThinker thanks — both points were correct, sorry for the lazy first pass. Pushed a follow-up commit (61f6b39) that addresses them:
The original deadlock fix is preserved: every rank loads its own encoder and encodes the negative prompt independently, no full
|
Addresses review feedback on hao-ai-lab#1178: the previous fix loaded the negative prompt encoder via transformers' T5EncoderModel, but Wan's text_encoder is UMT5 (per-layer relative position bias, not shared). Loading UMT5 weights into a T5 architecture silently produces wrong embeddings for the negative prompt and diverges the training-time CFG from inference. Switch to TextEncoderLoader so the encoder class is resolved from pipeline_config (UMT5EncoderModel for Wan) and the postprocess_text function is reused instead of imported by name. This keeps the fix to the original SP deadlock (every rank encodes independently; no full WanPipeline construction, no NCCL collectives) while staying inside the existing prompt-encoding abstraction. text_encoder_cpu_offload is forced off for this short-lived load to avoid initializing an FSDP device mesh, which would re-introduce collectives.
61f6b39 to
c9cf85e
Compare
Per-rank negative prompt encoding was duplicated inline in WanModel and will be needed for any future single-encoder model. Move it to fastvideo/train/utils/negative_prompt.py and have WanModel call into it. Hunyuan keeps its bespoke dual-encoder path for now. Also moves the in-method imports introduced by the deadlock fix out to the top of the helper module.
…ncoding The ensure_negative_conditioning rewrite in 88958db (this PR's first commit, Apr 21) predates encode_negative_prompt from hao-ai-lab#1178. Now that hao-ai-lab#1178 is merged, revert the in-method rank-0 + broadcast path to call the helper instead. This restores hao-ai-lab#1178's per-rank text-encoder load — the rank-0 + broadcast pattern removed here is the exact path hao-ai-lab#1178's docstring identifies as deadlocking on FSDP collectives during text-encoder load. The _prompt_pipeline_cls ClassVar hook is removed: the helper reads pipeline_config.text_encoder_configs directly, and LongCat's text encoder works through its own pipeline_config without any further override.
Summary
sp_size > 1ensure_negative_conditioning()created an inference pipeline on rank 0 only, butpost_init()calledwarmup_sequence_parallel_communication()which does all-to-all on the SP group — other ranks never participated, causing a hangTests
sp_size=4on 4 and 8 GPUs. No hangs, no quality degradation