Add RAE Diffusion Transformer inference/preliminary training pipelines by plugyawn · Pull Request #13231 · huggingface/diffusers

plugyawn · 2026-03-09T04:49:54Z

What does this PR do?

This PR adds support for Diffusion Transformers with Representation Autoencoders in Diffusers.

It implements the Stage-2 side of the RAE setup:

RAEDiT2DModel
RAEDiTPipeline
checkpoint conversion for published upstream Stage-2 checkpoints
API docs
a small examples/research_projects/rae_dit/ training scaffold

This addresses #13225.

Reference implementation: byteriper's repository

Validation

Inference parity with the official implementation is high. For matched class label / initial latent noise / schedule, I measured:

max_abs_error=0.00001717
mean_abs_error=0.00000122

Qualitative parity artifacts used during validation:

same published Stage-2 checkpoint
same class label
same initial latent noise
same 25-step shifted Euler schedule

Inference is also slightly faster in the current Diffusers port on a 40GB A100:

Precision	CFG	Steps	Diffusers sec/img	Upstream sec/img	Diffusers img/s	Delta
bf16	1.0	25	0.817	0.913	1.225	+11.8%
bf16	4.0	25	0.852	0.931	1.174	+9.3%
bf16	1.0	50	1.568	1.761	0.638	+12.3%
bf16	4.0	50	1.649	1.853	0.606	+12.4%

Notes

This PR intentionally does not add upstream autoguidance / guidance-model support.
The training script is a research-project scaffold under examples/research_projects, not a claim of full upstream training parity.
AutoencoderRAE.from_pretrained() is used for the Stage-1 component so the packaged RAEDiTPipeline.from_pretrained(...) path works with published RAE checkpoints.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

plugyawn · 2026-03-09T05:52:59Z

@kashif @sayakpaul would be great if you could review. Please note the no_init_weights() fix (details in the PR body); if you prefer, that could be a separate PR, but considering diffusers is supposed to be an extension to torch, I guess it makes sense?

sayakpaul · 2026-03-09T11:10:08Z

Thanks for the PR. To keep the scope manageable, could we break it down into separate PRs?

For example,

there is also a change to no_init_weights( ). Specifically: it makes Diffusers’ skip-weight-init behave more like normal PyTorch. Now, when no_init_weights() is active, the torch.nn.init.* functions stop returning the tensor they were called on (for ref: PyTorch does return). Most models never notice this, but the RAE-DiT implementation does rely on the return value during construction, which can make otherwise valid checkpoints fail to load through the standard from_pretrained() path.

could be a separate PR.

sayakpaul

Thanks!

I left some initial comments, let me know if they make sense.

sayakpaul · 2026-03-09T11:11:57Z

+- `examples/dreambooth/train_dreambooth_flux.py`
+  for the flow-matching training loop structure, checkpoint resume flow, and `accelerate.save_state(...)` hooks.
+- `examples/flux-control/train_control_flux.py`
+  for the transformer-only save layout and SD3-style flow-matching timestep weighting helpers.


Doesn't belong here.

sayakpaul · 2026-03-09T11:13:09Z

+        # Preserve the `torch.nn.init.*` return contract so third-party model
+        # constructors that chain on the returned tensor still work under
+        # `no_init_weights()`.
+        return args[0] if len(args) > 0 else None


Can you provide an example?

sayakpaul · 2026-03-09T11:14:56Z

+        super().test_effective_gradient_checkpointing(loss_tolerance=1e-4)
+
+    @unittest.skip(
+        "RAEDiT initializes the output head to zeros, so cosine-based layerwise casting checks are uninformative."


I don't think this is the case? We can always skip layerwise casting for certain layer or layer groups here:

diffusers/src/diffusers/models/modeling_utils.py

Line 246 in a08c274

_skip_layerwise_casting_patterns = None

sayakpaul · 2026-03-09T11:15:35Z

+    model.final_layer.linear.bias.data.normal_(mean=0.0, std=0.02)
+
+
+class RAEDiT2DModelTests(ModelTesterMixin, unittest.TestCase):


Test should use the newly added model tester mixins. You can find an example in #13046

sayakpaul · 2026-03-09T11:19:53Z

+    if shift is None:
+        shift = torch.zeros_like(scale)


This is a small function, which is okay being present in the caller sites inline?

We also probably don't need _repeat_to_length().

sayakpaul · 2026-03-09T11:28:52Z

+        if self.use_pos_embed:
+            pos_embed = get_2d_sincos_pos_embed(
+                self.pos_embed.shape[-1], int(sqrt(self.pos_embed.shape[1])), output_type="pt"
+            )
+            self.pos_embed.data.copy_(pos_embed.float().unsqueeze(0))


Can we use how #13046 initialized the position embeddings?

Yeah, that makes sense, will do that.

sayakpaul · 2026-03-09T11:29:21Z

+        )
+        return hidden_states
+
+    def _run_block(


We don't need this. Let's instead follow this pattern:

diffusers/src/diffusers/models/transformers/transformer_flux.py

Line 714 in a08c274

for index_block, block in enumerate(self.transformer_blocks):

sayakpaul · 2026-03-09T11:30:10Z

+
+        return class_labels
+
+    def _prepare_latents(


It should be called prepare_latents() similar to other pipelines.

sayakpaul · 2026-03-09T11:31:10Z

+            if output_type == "pt":
+                output = images
+            else:
+                output = images.cpu().permute(0, 2, 3, 1).float().numpy()
+                if output_type == "pil":
+                    output = self.numpy_to_pil(output)


We should use an image processor instead here. See:

diffusers/src/diffusers/pipelines/flux/pipeline_flux.py

Line 1012 in a08c274

image = self.image_processor.postprocess(image, output_type=output_type)

sayakpaul · 2026-03-09T11:31:30Z

+        if not return_dict:
+            return (output,)
+
+        return ImagePipelineOutput(images=output)


Let's give this pipeline a separate output class: RAEDiTPipelineOutput.

plugyawn · 2026-03-10T02:42:07Z

@sayakpaul, from what I understand the RAE checkpoint -> DiT checkpoint -> generation pipeline necessarily requires the no_init_weight() change (otherwise the semantics become a bit muddled, imo).

Would it make more sense to open a PR for handling no_init_weights() behavior before this one?

sayakpaul · 2026-03-10T02:44:11Z

Could you explain why that's needed? I am still not sure about that actually. Prefer providing specific examples that fail without the change for init.

plugyawn · 2026-03-10T03:27:12Z

Not sure how to link files, but it seems to be related to changes introduced in #13046.

A specific example,

AutoencoderRAE consturcts DinoV2WithRegistersModel.
ModelMixin.from_pretrained() does this construction under no_init_weights( ) first, before low_cpu_mem_usage kicks in (modelling_utils.py, around line 1300)
AutoencoderRAE constructs Dinov2WithRegistersModel(config) in _build_encoder:84, and
ModelMixin.from_pretrained() always does that construction under no_init_weights() first, even
before low_cpu_mem_usage matters; see modeling_utils.py:1270. In current transformers, DINOv2-
with-registers has init code like this in modeling_dinov2_with_registers.py:464:

  module.weight.data = nn.init.trunc_normal_(
      module.weight.data.to(torch.float32), mean=0.0, std=self.config.initializer_range
  ).to(module.weight.dtype)

Under today’s no_init_weights(), nn.init.trunc_normal_ is replaced with a stub that just passes
and returns None, so that becomes None.to(...) and fails with an AttributeError: 'NoneType' object has no attribute 'to'.

Codex has a better summary, I think:

failing example: AutoencoderRAE builds Dinov2WithRegistersModel(config) in its encoder
path, and ModelMixin.from_pretrained() always instantiates models under no_init_weights() first.
In current transformers, DINOv2’s init_weights() assigns the return value of
nn.init.trunc_normal(...) and then calls .to(...) on it. With the current no_init_weights()
stub, that return value becomes None, so construction fails with AttributeError: 'NoneType'
object has no attribute 'to'. The proposed change keeps skip-init behavior intact, but restores
the normal PyTorch return contract so these constructors remain compatible.

Re: #13046, note test_models_autoencoder_rae.py:45, where the unit tests seem to be a little off, imo. Not sure the tests are aligned.

# ---------------------------------------------------------------------------
# Tiny test encoder for fast unit tests (no transformers dependency)
# ---------------------------------------------------------------------------


class _TinyTestEncoderModule(torch.nn.Module):
    """Minimal encoder that mimics the patch-token interface without any HF model."""

    def __init__(self, hidden_size: int = 16, patch_size: int = 8, **kwargs):
        super().__init__()
        self.patch_size = patch_size
        self.hidden_size = hidden_size

    def forward(self, images: torch.Tensor) -> torch.Tensor:
        pooled = F.avg_pool2d(images.mean(dim=1, keepdim=True), kernel_size=self.patch_size, stride=self.patch_size)
        tokens = pooled.flatten(2).transpose(1, 2).contiguous()
        return tokens.repeat(1, 1, self.hidden_size)


def _tiny_test_encoder_forward(model, images):
    return model(images)


def _build_tiny_test_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers):
    return _TinyTestEncoderModule(hidden_size=hidden_size, patch_size=patch_size)


# Monkey-patch the dispatch tables so "tiny_test" is recognised by AutoencoderRAE
_ENCODER_FORWARD_FNS["tiny_test"] = _tiny_test_encoder_forward
_original_build_encoder = _build_encoder


def _patched_build_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers):
    if encoder_type == "tiny_test":
        return _build_tiny_test_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers)
    return _original_build_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers)


_rae_module._build_encoder = _patched_build_encoder

I'm new to diffusers idiomatics, but I was confused why this appeared to be a problem only now, and asked GPT:

no_init_weights() only becomes a problem when all of these are true at once:

a diffusers ModelMixin.from_pretrained() call is constructing the model

that model’s init() instantiates another model internally

that internal model uses torch.nn.init.* and also relies on its return value

RAE is unusual because it does exactly that. Inside autoencoder_rae.py, the AutoencoderRAE constructor directly >builds a transformers vision backbone:

Dinov2WithRegistersModel:98

SiglipVisionModel:111

ViTMAEModel:124

That is not how most other diffusers integrations are structured. Most of the repo does one of these instead:

native diffusers models in src/diffusers/models, whose init code only relies on side effects

pipelines that accept transformers models as separate top-level components, rather than constructing them inside > a ModelMixin

So other work usually does not run a transformers constructor inside diffusers’ patched no_init_weights() context.

sayakpaul · 2026-03-10T03:33:02Z

Not sure how to link files

Yes, we can link files and I think it's better this way. For example, it's much better to refer to specific lines like

diffusers/src/diffusers/models/modeling_utils.py

Line 129 in 068c6ef

def get_parameter_device(parameter: torch.nn.Module) -> torch.device:

instead of plain text.

Overall, I think that the explanation you provided in the above comment is that helpful. We need to have some specific (preferably very minimal) code snippet with and without that change to better understand what's happening and why.

For this kind of PRs, it's an expectation that the contributors will try to take some time to understand the library code.

plugyawn · 2026-03-12T08:24:42Z

Hi @sayakpaul! My bad, I'll update the PR today.

sayakpaul · 2026-03-25T09:55:30Z

@plugyawn do you want to take another crack?

plugyawn · 2026-03-25T10:00:20Z

Yess, I'm reading through the diffusers codebase in more detail (I had used it quite a bit but not dived as deep as I had for transformers, hehe), and waiting for some compute to come through (I ran out, should be back by this week).

Thank you so much for waiting! I'll update the PR by this week.

plugyawn · 2026-03-30T06:06:37Z

@sayakpaul, I removed old no_init_weights() change; as you said, there is a workaround for now.
The nested-init issue is handled locally in AutoencoderRAE now.

The diff is pretty different from the last round, so if you have time, it'd be great if you could review it again! Sorry for the wait!

sayakpaul · 2026-03-30T06:25:45Z

@kashif possible to do a round of review?

plugyawn · 2026-03-30T06:35:41Z

Also, I'm pretty certain the merged RAE implementation has some faulty testing.
I'm not sure how it works with the no_init_weights( ) problem unfixed; the tests also use a fake encoder. Might have gone under the radar?

plugyawn · 2026-04-02T09:42:44Z

@kashif, @sayakpaul would you prefer a separate PR (or here) to fix the testing issues on the RAE?

kashif · 2026-04-02T10:12:52Z

yes please, seperate!

plugyawn · 2026-04-02T12:45:41Z

Ill take this as a yes for a separate PR!

sayakpaul · 2026-04-10T05:06:04Z

@kashif could you do an initial review please?

plugyawn · 2026-04-12T18:02:36Z

@kashif @sayakpaul is there a timeline on this?

plugyawn · 2026-04-19T07:35:27Z

@sayakpaul, just a small bump. Have a free week, would love to get this to a mergeable state.

sayakpaul · 2026-04-19T07:40:10Z

@claude could you do a review here?

github-actions · 2026-04-19T07:40:31Z

Claude finished @sayakpaul's task in 2m 37s —— View job

Code Review: RAE Diffusion Transformer Pipeline

Read review guidelines (AGENTS.md, models.md, modular.md, parity-testing)
Read all changed files under src/diffusers/
Review model code (transformer_rae_dit.py)
Review pipeline code (pipeline_rae_dit.py, pipeline_output.py, __init__.py)
Review autoencoder changes (autoencoder_rae.py)
Review init/export changes
Post comprehensive review

Overall this is a solid contribution with good inference parity numbers. The code is clean and readable. Below are the issues I found, grouped by severity.

Must-fix

1. Attention does not follow the diffusers attention pattern

RAEDiTAttention (transformer_rae_dit.py:154-199) calls F.scaled_dot_product_attention directly. Per models.md, all new models must use dispatch_attention_fn via a processor class and inherit AttentionModuleMixin. Every other recent transformer in the repo (Flux, Wan, Bria, LTX, PRX, etc.) follows this pattern. This is required for attention backend switching (e.g. xformers, flash-attn, ring attention for sequence parallelism).

The fix is to:

Create a RAEDiTAttnProcessor class with _attention_backend / _parallel_config attributes and move the forward logic there, using dispatch_attention_fn
Make RAEDiTAttention inherit from (nn.Module, AttentionModuleMixin), set _default_processor_cls = RAEDiTAttnProcessor, and delegate to self.processor(...) in forward()

See transformer_flux.py:75-139 (processor) and transformer_flux.py:275-325 (attention class) for the canonical reference.

Fix this →

2. Missing `_no_split_modules` on `RAEDiT2DModel`

RAEDiT2DModel (transformer_rae_dit.py:312) is missing _no_split_modules. Every other transformer model in the repo declares this for correct device placement with accelerate. Should be:

_no_split_modules = ["RAEDiTBlock"]

3. `unpatchify` uses `torch.einsum` — potential `torch.compile` graph break

transformer_rae_dit.py:470:

hidden_states = torch.einsum("nhwpqc->nchpwq", hidden_states)

Per models.md: "Avoid graph breaks for torch.compile compatibility." While einsum is generally compile-safe, the string-based notation can cause issues in some backends. More importantly, this is an existing pattern used by other models (DiT, SD3, PixArt), so it's acceptable here. However, since models.md says to "avoid NumPy operations in forward implementations," note that einsum string-based notation is fine as long as it's pure PyTorch. This is a soft pass — just flagging for awareness.

Should-fix

4. `wo_shift` path allocates unnecessary zero tensors

In RAEDiTBlock.forward() (transformer_rae_dit.py:264-267):

if shift_msa is None:
    shift_msa = torch.zeros_like(scale_msa)
if shift_mlp is None:
    shift_mlp = torch.zeros_like(scale_mlp)

When wo_shift=True, this allocates two zero tensors every forward pass just to add zero. The subsequent modulation (norm * (1 + scale) + shift) with shift=0 is mathematically just norm * (1 + scale). Consider branching the modulation logic instead:

norm_hidden_states = self.norm1(hidden_states)
if shift_msa is not None:
    norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
else:
    norm_hidden_states = norm_hidden_states * (1 + scale_msa)

5. `GaussianFourierEmbedding.W` naming convention

transformer_rae_dit.py:76: The parameter is named self.W (uppercase). While this matches the reference implementation, diffusers convention uses snake_case for parameters/attributes. Consider renaming to self.weight or self.fourier_weight.

6. `_prepare_timesteps` in pipeline is boilerplate that could use the scheduler directly

pipeline_rae_dit.py:117-133: The _prepare_timesteps method duplicates timestep preparation logic. Since FlowMatchEulerDiscreteScheduler already provides timesteps as tensors from set_timesteps(), the conversion logic for MPS/NPU is only needed if raw floats are passed. This works but is unnecessarily complex — the timestep from self.scheduler.timesteps is already a tensor on the correct device.

7. Missing copyright header in `transformer_rae_dit.py`

The new file transformer_rae_dit.py has no Apache 2.0 license header, unlike pipeline_output.py and autoencoder_rae.py. All new source files in diffusers should include the standard header.

8. `pos_embed` registered as `persistent=False` but used for sincos embeddings

transformer_rae_dit.py:403:

self.register_buffer("pos_embed", pos_embed.unsqueeze(0).float(), persistent=False)

Non-persistent buffers are not saved with state_dict(). If this is intentional (regenerated from config on load), that's fine. But if users save/load with save_pretrained/from_pretrained, the buffer will be regenerated correctly only if the code re-runs __init__. This seems correct here since the pos embed is deterministic from config — just confirming the intent.

Nits / Minor

9. Pipeline `init` has unused `_guidance_scale` attribute

pipeline_rae_dit.py:55: self._guidance_scale = 1.0 is set in __init__ but immediately overwritten in __call__ at line 191. This is a common pattern in diffusers pipelines, so it's fine — just noting it's redundant initialization.

10. `_expand_conditioning_tokens` is a module-level helper

This was flagged in a previous review. While sayakpaul suggested inlining small helpers, this function is used in 3 places (RAEDiTBlock.forward for MSA and MLP expansion, RAEDiTFinalLayer.forward), so keeping it as a helper is reasonable given its 20+ lines. The current approach is acceptable.

11. `_swap_swiglu_projection_halves` accesses `.data` directly

transformer_rae_dit.py:60-68: Directly manipulating .data bypasses autograd. Since this is called only during initialize_weights() (construction time), not during forward, it's safe. Just noting the pattern.

12. `pipeline_rae_dit.py:232` — timestep normalization coupling

timestep_input = timestep_input / self.scheduler.config.num_train_timesteps

This hardcodes the assumption that the scheduler provides timesteps in [0, num_train_timesteps] range and the model expects [0, 1]. If a different scheduler is used, this will silently produce wrong results. Consider adding a comment or assertion documenting this coupling.

Init / Export changes

The lazy import changes in __init__.py, models/__init__.py, pipelines/__init__.py, transformers/__init__.py, and dummy_pt_objects.py all look correct and follow alphabetical ordering. The _import_structure entries match the TYPE_CHECKING imports.

Autoencoder changes (`autoencoder_rae.py`)

The _preserve_init_return_tensors() context manager is a clean localized solution to the no_init_weights() issue discussed in the PR comments. It wraps torch.nn.init.* functions to always return the tensor, avoiding the NoneType error when constructing transformers models under diffusers' weight-init suppression. The approach is sound — it doesn't modify global behavior permanently and restores original functions in the finally block.

Summary

The main blocker is issue #1 (attention pattern). This is a hard requirement per models.md and is needed for the model to work with diffusers' attention backend infrastructure. Issue #2 (_no_split_modules) is also important for correct multi-GPU behavior. The remaining items are improvements that could be addressed incrementally.

plugyawn · 2026-04-19T09:52:56Z

Good bot.

plugyawn changed the title ~~Add Stage-2 RAE DiT support with pipeline, conversion, and training tooling~~ RAE DiT inference, checkpoint conversion, and preliminary training tooling Mar 9, 2026

plugyawn changed the title ~~RAE DiT inference, checkpoint conversion, and preliminary training tooling~~ Add RAE Diffusion Transformer inference/preliminary training pipelines Mar 9, 2026

plugyawn marked this pull request as draft March 9, 2026 05:46

plugyawn marked this pull request as ready for review March 9, 2026 05:51

sayakpaul reviewed Mar 9, 2026

View reviewed changes

sayakpaul requested review from dg845 and kashif March 9, 2026 11:33

plugyawn added 14 commits March 30, 2026 11:29

Add Stage-2 RAE DiT model, pipeline, and tooling

6957708

Add RAE DiT resume-order verifier

d5bf0d9

Add RAE DiT training smoke test

f422041

Sync RAE DiT stack with diffusers quality checks

626a89d

Add RAE DiT API docs

9c8f052

Rename RAEDiTTransformer2DModel to RAEDiT2DModel

8314f0e

Fix RAE DiT review regressions

aa47b26

Remove RAE DiT validation helper scripts from PR

a2506ff

Add RAE DiT training validation sampling

bc1b237

Align RAE DiT with diffusers patterns

054ae73

Localize RAE loading and drop unused guidance transformer

c871658

Localize RAE nested init loading

fa2d665

Fix RAEDiT projector init and dtype handling

34d6351

Harden RAE DiT training schedule helpers

afc2db7

Harden RAE DiT conversion and pipeline helpers

b9d46ca

plugyawn force-pushed the rae-dit-training branch from fd0bf6c to b9d46ca Compare March 30, 2026 06:01

Merge branch 'main' into rae-dit-training

9354b55

plugyawn requested a review from sayakpaul April 8, 2026 16:42

Merge branch 'main' into rae-dit-training

b16c31f

github-actions bot added documentation Improvements or additions to documentation models tests utils pipelines examples size/L PR with diff > 200 LOC labels Apr 19, 2026

Fix RAEDiT scheduler and conditioning edge cases

c9770f6

github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 20, 2026

		model.final_layer.linear.bias.data.normal_(mean=0.0, std=0.02)


		class RAEDiT2DModelTests(ModelTesterMixin, unittest.TestCase):

Conversation

plugyawn commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Validation

Notes

Before submitting

Uh oh!

plugyawn commented Mar 9, 2026

Uh oh!

sayakpaul commented Mar 9, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plugyawn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Mar 10, 2026

Uh oh!

plugyawn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Mar 10, 2026

Uh oh!

plugyawn commented Mar 12, 2026

Uh oh!

sayakpaul commented Mar 25, 2026

Uh oh!

plugyawn commented Mar 25, 2026

Uh oh!

plugyawn commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Mar 30, 2026

Uh oh!

plugyawn commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

plugyawn commented Apr 2, 2026

Uh oh!

kashif commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

plugyawn commented Apr 2, 2026

Uh oh!

sayakpaul commented Apr 10, 2026

Uh oh!

plugyawn commented Apr 12, 2026

Uh oh!

plugyawn commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Apr 19, 2026

Uh oh!

github-actions bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: RAE Diffusion Transformer Pipeline

plugyawn commented Mar 9, 2026 •

edited

Loading

plugyawn commented Mar 10, 2026 •

edited

Loading

plugyawn commented Mar 10, 2026 •

edited

Loading

plugyawn commented Mar 30, 2026 •

edited

Loading

plugyawn commented Mar 30, 2026 •

edited

Loading

kashif commented Apr 2, 2026 •

edited

Loading

plugyawn commented Apr 19, 2026 •

edited

Loading

github-actions bot commented Apr 19, 2026 •

edited

Loading

2. Missing `_no_split_modules` on `RAEDiT2DModel`

3. `unpatchify` uses `torch.einsum` — potential `torch.compile` graph break

4. `wo_shift` path allocates unnecessary zero tensors

5. `GaussianFourierEmbedding.W` naming convention

6. `_prepare_timesteps` in pipeline is boilerplate that could use the scheduler directly

7. Missing copyright header in `transformer_rae_dit.py`

8. `pos_embed` registered as `persistent=False` but used for sincos embeddings

9. Pipeline `init` has unused `_guidance_scale` attribute

10. `_expand_conditioning_tokens` is a module-level helper

11. `_swap_swiglu_projection_halves` accesses `.data` directly

12. `pipeline_rae_dit.py:232` — timestep normalization coupling

Autoencoder changes (`autoencoder_rae.py`)