Skip to content

[Video Generation] Add VAE encoder support to AutoencoderKLLTXVideo#3829

Open
goyaladitya05 wants to merge 5 commits into
openvinotoolkit:masterfrom
goyaladitya05:feature/ltx-vae-encoder-support
Open

[Video Generation] Add VAE encoder support to AutoencoderKLLTXVideo#3829
goyaladitya05 wants to merge 5 commits into
openvinotoolkit:masterfrom
goyaladitya05:feature/ltx-vae-encoder-support

Conversation

@goyaladitya05
Copy link
Copy Markdown
Contributor

@goyaladitya05 goyaladitya05 commented May 8, 2026

This PR implements AutoencoderKLLTXVideo::encode(), enabling Image-to-Video workflows where a conditioning frame is encoded into latent space and passed to the diffusion pipeline.

This is phase 1 of Image-to-Video support in LTX Video Generation pipeline.

Changes

  • VAE encoder now compiles and runs. compile() and reshape() had // TODO: for img2video. Both are now implemented.
  • encode(video, generator) - takes a [B, C, F, H, W] video tensor, runs the encoder model, and returns a normalized latent ready for the diffusion transformer. Handles both model output variants:
    • latent_parameters - samples z from the predicted distribution
    • latent_sample - uses the output directly (no sampling needed)
  • Added 7 tests which cover construction, error messages, output shape, determinism with the same seed, and variation across seeds. Tests that require the encoder skip if vae_encoder is not present in the test model.

Testing

Encode decode roundtrip on an image (512×512, CPU):

Step Value
Input [1, 3, 1, 512, 512]
Latent [1, 128, 1, 16, 16] (32× spatial compression)
Output [1, 1, 512, 512, 3]
compare_overture (2)

Checklist:

  • This PR follows GenAI Contributing guidelines.
  • Tests have been updated or added to cover the new code.
  • This PR fully addresses the ticket.
  • I have made corresponding changes to the documentation. No necessary changes.

Copilot AI review requested due to automatic review settings May 8, 2026 16:34
@github-actions github-actions Bot added category: Python API Python API for GenAI category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: video generation labels May 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds VAE encoder support to the LTX video autoencoder, enabling encoding of a conditioning frame/video into latent space for upcoming Image-to-Video workflows.

Changes:

  • Implemented encoder compilation/reshape and added AutoencoderKLLTXVideo::encode() with support for latent_parameters sampling and latent_sample passthrough + normalization.
  • Exposed encode() in Python bindings.
  • Added Python tests covering encoder construction, error paths, output shape, and seed determinism/variation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
tests/python_tests/test_video_generation.py Adds encoder-focused tests (construction, error messages, shape, determinism/seed variation).
src/python/py_video_generation_models.cpp Exposes AutoencoderKLLTXVideo.encode() via pybind11 with GIL release.
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp Implements encoder compile/reshape and encode() including sampling + latent normalization.
src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp Adds public encode() declaration and stores encoder output name in class state.

Comment thread src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Comment thread src/python/py_video_generation_models.cpp Outdated
Comment thread tests/python_tests/test_video_generation.py
Comment thread src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp Outdated
@goyaladitya05 goyaladitya05 marked this pull request as ready for review May 8, 2026 17:49
Copilot AI review requested due to automatic review settings May 8, 2026 17:49
@goyaladitya05 goyaladitya05 moved this from Todo to In progress in LTX Video Image-to-Video Support May 8, 2026
@goyaladitya05 goyaladitya05 moved this from In progress to Pull Requests in LTX Video Image-to-Video Support May 8, 2026
@goyaladitya05 goyaladitya05 moved this from Pull Requests to In progress in LTX Video Image-to-Video Support May 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Comment thread src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp Outdated
Copilot AI review requested due to automatic review settings May 14, 2026 10:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment on lines +290 to +295
// inverse of denormalize_latents used in the decode path
const ov::Shape shape = latent.get_shape();
OPENVINO_ASSERT(shape.size() == 5, "Encoder output expected to be [B, C, F, H, W]");
OPENVINO_ASSERT(latent.get_element_type() == ov::element::f32,
"Latent normalization requires f32, got ", latent.get_element_type());
const size_t B = shape[0], C = shape[1], spatial = shape[2] * shape[3] * shape[4];
Copy link
Copy Markdown
Contributor Author

@goyaladitya05 goyaladitya05 May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. This should also affect image generation when models are exported in full fp16 and bf16 using GPU. I'll verify this once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goyaladitya05 goyaladitya05 force-pushed the feature/ltx-vae-encoder-support branch 2 times, most recently from 43ffd94 to 5dab8e8 Compare May 14, 2026 16:05
@goyaladitya05 goyaladitya05 requested a review from Copilot May 14, 2026 16:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp Outdated
Comment thread tests/python_tests/test_video_generation.py Outdated
@goyaladitya05
Copy link
Copy Markdown
Contributor Author

cc @sgonorov @likholat

@goyaladitya05 goyaladitya05 changed the title [Video Generation] Add VAE encoder support to AutoencoderKLLTXVideo [Video Generation] Add VAE encoder support for Video Generation May 14, 2026
@goyaladitya05 goyaladitya05 changed the title [Video Generation] Add VAE encoder support for Video Generation [Video Generation] Add VAE encoder support to AutoencoderKLLTXVideo May 14, 2026
Copilot AI review requested due to automatic review settings May 15, 2026 19:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp:295

  • encode() asserts the latent is f32 and normalizes via latent.data() in-place. On GPU (or with inference-precision hints) the encoder output/latent may be f16/bf16, which will currently fail at runtime. To make encode() robust across devices, consider either converting the sampled/copied latent to f32 before normalization, or performing normalization in a dtype-generic way similar to denormalize_latents() in video_generation/ltx_pipeline.hpp.
    OPENVINO_ASSERT(shape.size() == 5, "Encoder output expected to be [B, C, F, H, W]");
    OPENVINO_ASSERT(latent.get_element_type() == ov::element::f32,
        "Latent normalization requires f32, got ", latent.get_element_type());
    const size_t B = shape[0], C = shape[1], spatial = shape[2] * shape[3] * shape[4];

Comment thread src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Comment thread tests/python_tests/test_video_generation.py
Comment thread src/python/openvino_genai/py_openvino_genai.pyi Outdated
@goyaladitya05 goyaladitya05 force-pushed the feature/ltx-vae-encoder-support branch from 169aaa9 to 995c396 Compare May 15, 2026 20:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp:295

  • encode() asserts that the produced latent must be f32 and then normalizes it via latent.data(). On GPU / fp16 or bf16 exports this will currently fail even though inference succeeds, and it blocks Image-to-Video usage on those configurations. Consider making the normalization path dtype-generic (f16/bf16/f32) or explicitly converting the latent to f32 before normalization, similar to the robustness work tracked in #3865.
    const ov::Shape shape = latent.get_shape();
    OPENVINO_ASSERT(shape.size() == 5, "Encoder output expected to be [B, C, F, H, W]");
    OPENVINO_ASSERT(latent.get_element_type() == ov::element::f32,
        "Latent normalization requires f32, got ", latent.get_element_type());
    const size_t B = shape[0], C = shape[1], spatial = shape[2] * shape[3] * shape[4];

Comment thread src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Comment thread src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp:295

  • encode() hard-requires the latent tensor to be f32 for normalization (latent.get_element_type() == f32). This makes encode() unusable for fp16/bf16 IR exports or GPU inference-precision hints, even though the decode path already supports f16/bf16 (denormalize_latents in ltx_pipeline.hpp is dtype-generic). Consider normalizing in a dtype-generic way (e.g., via OpenVINO ops/broadcast like denormalize_latents does) or explicitly converting the latent to f32 before normalization and documenting the output dtype.
    OPENVINO_ASSERT(shape.size() == 5, "Encoder output expected to be [B, C, F, H, W]");
    OPENVINO_ASSERT(latent.get_element_type() == ov::element::f32,
        "Latent normalization requires f32, got ", latent.get_element_type());
    const size_t B = shape[0], C = shape[1], spatial = shape[2] * shape[3] * shape[4];

Comment thread src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Copy link
Copy Markdown
Contributor

@sgonorov sgonorov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also refactor tests a little bit - extract more fixtures for less boilerplate, but overall looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: Python API Python API for GenAI category: video generation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants