Skip to content

Normalize negative_prompt in LongCatAudioDiT pipeline for CFG symmetry#13525

Open
Ricardo-M-L wants to merge 1 commit intohuggingface:mainfrom
Ricardo-M-L:fix-longcat-audio-dit-negative-prompt-normalize
Open

Normalize negative_prompt in LongCatAudioDiT pipeline for CFG symmetry#13525
Ricardo-M-L wants to merge 1 commit intohuggingface:mainfrom
Ricardo-M-L:fix-longcat-audio-dit-negative-prompt-normalize

Conversation

@Ricardo-M-L
Copy link
Copy Markdown
Contributor

What this PR does

In LongCatAudioDiTPipeline.__call__, positive prompts are normalized via _normalize_text before being encoded:

normalized_prompts = [_normalize_text(text) for text in prompt]
...
prompt_embeds, prompt_embeds_len = self.encode_prompt(normalized_prompts, device)

_normalize_text lowercases the string, strips ASCII and curly quote characters ("“”‘’), and collapses whitespace. The upstream reference implementation meituan-longcat/LongCat-AudioDiT applies the same normalize_text pass to every piece of text it feeds to the UMT5 text encoder.

However, when a user passes a negative_prompt, the pipeline currently encodes it raw:

negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt(negative_prompt, device)

Why this is a bug

The model was trained with normalized text on both branches, so a raw negative prompt lands off-distribution relative to the normalized positive prompt. In classifier-free guidance

noise = neg + guidance_scale * (pos - neg)

the (pos - neg) delta no longer represents a clean "target − avoid" direction — it's partly absorbing normalization noise (case, quote glyphs, whitespace). That weakens/distorts guidance rather than guiding away from what the user described.

Examples that currently fail to behave symmetrically:

  • negative_prompt="Loud Noise" encodes differently from the positive path's "loud noise"
  • negative_prompt='"speech"' keeps the quote glyphs that the positive path would strip
  • negative_prompt="multiple speakers" keeps the extra whitespace

The empty/None branch is unaffected — it already uses a zero tensor to match the reference model.

Fix

Apply the same _normalize_text pass to user-supplied negative prompts before encoding, restoring symmetry with the positive branch and with the reference pipeline.

 if negative_prompt is None or (isinstance(negative_prompt, str) and negative_prompt == ""):
     negative_prompt_embeds = torch.zeros(...)
     negative_prompt_embeds_len = torch.tensor([1] * batch_size, device=device)
 else:
-    negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt(negative_prompt, device)
+    normalized_negative_prompts = [_normalize_text(text) for text in negative_prompt]
+    negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt(
+        normalized_negative_prompts, device
+    )

Before submitting

  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue or the forum? N/A — small consistency fix.
  • Did you make sure to update the documentation with your changes? N/A — no public API change.
  • Did you write any new necessary tests? N/A — mirrors the existing positive-prompt path.

Who can review?

@yiyixuxu @sayakpaul

The LongCatAudioDiT pipeline pre-processes positive prompts through
`_normalize_text` before encoding:

```python
normalized_prompts = [_normalize_text(text) for text in prompt]
...
prompt_embeds, prompt_embeds_len = self.encode_prompt(normalized_prompts, device)
```

`_normalize_text` lowercases the string, strips ASCII and curly quote
characters (`"“”‘’`), and collapses whitespace. The upstream reference
implementation applies the same normalization (`normalize_text` in
`utils.py`) to every piece of text it feeds to the UMT5 text encoder.

However, when a user passes a `negative_prompt`, the pipeline encodes
it raw:

```python
negative_prompt_embeds, negative_prompt_embeds_len = self.encode_prompt(
    negative_prompt, device
)
```

That creates a distribution mismatch between the conditional and
unconditional text embeddings used in CFG — the model was trained
against normalized text on both branches, so a raw negative prompt
lands off-distribution and weakens/distorts guidance rather than
guiding away from the described audio. Empty/`None` `negative_prompt`
is unaffected (that path uses a zero tensor to match the reference).

This change applies the same `_normalize_text` pass to user-supplied
negative prompts before encoding, restoring symmetry with the positive
branch and with the reference pipeline.
@github-actions github-actions Bot added pipelines size/S PR with diff < 50 LOC labels Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pipelines size/S PR with diff < 50 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant