[Video Generation] Add VAE encoder support to AutoencoderKLLTXVideo#3829
[Video Generation] Add VAE encoder support to AutoencoderKLLTXVideo#3829goyaladitya05 wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds VAE encoder support to the LTX video autoencoder, enabling encoding of a conditioning frame/video into latent space for upcoming Image-to-Video workflows.
Changes:
- Implemented encoder compilation/reshape and added
AutoencoderKLLTXVideo::encode()with support forlatent_parameterssampling andlatent_samplepassthrough + normalization. - Exposed
encode()in Python bindings. - Added Python tests covering encoder construction, error paths, output shape, and seed determinism/variation.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/python_tests/test_video_generation.py | Adds encoder-focused tests (construction, error messages, shape, determinism/seed variation). |
| src/python/py_video_generation_models.cpp | Exposes AutoencoderKLLTXVideo.encode() via pybind11 with GIL release. |
| src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp | Implements encoder compile/reshape and encode() including sampling + latent normalization. |
| src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp | Adds public encode() declaration and stores encoder output name in class state. |
| // inverse of denormalize_latents used in the decode path | ||
| const ov::Shape shape = latent.get_shape(); | ||
| OPENVINO_ASSERT(shape.size() == 5, "Encoder output expected to be [B, C, F, H, W]"); | ||
| OPENVINO_ASSERT(latent.get_element_type() == ov::element::f32, | ||
| "Latent normalization requires f32, got ", latent.get_element_type()); | ||
| const size_t B = shape[0], C = shape[1], spatial = shape[2] * shape[3] * shape[4]; |
There was a problem hiding this comment.
Interesting. This should also affect image generation when models are exported in full fp16 and bf16 using GPU. I'll verify this once.
43ffd94 to
5dab8e8
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp:295
- encode() asserts the latent is f32 and normalizes via latent.data() in-place. On GPU (or with inference-precision hints) the encoder output/latent may be f16/bf16, which will currently fail at runtime. To make encode() robust across devices, consider either converting the sampled/copied latent to f32 before normalization, or performing normalization in a dtype-generic way similar to denormalize_latents() in video_generation/ltx_pipeline.hpp.
OPENVINO_ASSERT(shape.size() == 5, "Encoder output expected to be [B, C, F, H, W]");
OPENVINO_ASSERT(latent.get_element_type() == ov::element::f32,
"Latent normalization requires f32, got ", latent.get_element_type());
const size_t B = shape[0], C = shape[1], spatial = shape[2] * shape[3] * shape[4];
169aaa9 to
995c396
Compare
995c396 to
4fcfda6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp:295
- encode() asserts that the produced latent must be f32 and then normalizes it via latent.data(). On GPU / fp16 or bf16 exports this will currently fail even though inference succeeds, and it blocks Image-to-Video usage on those configurations. Consider making the normalization path dtype-generic (f16/bf16/f32) or explicitly converting the latent to f32 before normalization, similar to the robustness work tracked in #3865.
const ov::Shape shape = latent.get_shape();
OPENVINO_ASSERT(shape.size() == 5, "Encoder output expected to be [B, C, F, H, W]");
OPENVINO_ASSERT(latent.get_element_type() == ov::element::f32,
"Latent normalization requires f32, got ", latent.get_element_type());
const size_t B = shape[0], C = shape[1], spatial = shape[2] * shape[3] * shape[4];
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp:295
encode()hard-requires the latent tensor to bef32for normalization (latent.get_element_type() == f32). This makesencode()unusable for fp16/bf16 IR exports or GPU inference-precision hints, even though the decode path already supports f16/bf16 (denormalize_latentsinltx_pipeline.hppis dtype-generic). Consider normalizing in a dtype-generic way (e.g., via OpenVINO ops/broadcast likedenormalize_latentsdoes) or explicitly converting the latent tof32before normalization and documenting the output dtype.
OPENVINO_ASSERT(shape.size() == 5, "Encoder output expected to be [B, C, F, H, W]");
OPENVINO_ASSERT(latent.get_element_type() == ov::element::f32,
"Latent normalization requires f32, got ", latent.get_element_type());
const size_t B = shape[0], C = shape[1], spatial = shape[2] * shape[3] * shape[4];
sgonorov
left a comment
There was a problem hiding this comment.
Can also refactor tests a little bit - extract more fixtures for less boilerplate, but overall looks good.
This PR implements
AutoencoderKLLTXVideo::encode(), enabling Image-to-Video workflows where a conditioning frame is encoded into latent space and passed to the diffusion pipeline.This is phase 1 of Image-to-Video support in LTX Video Generation pipeline.
Changes
vae_encoderis not present in the test model.Testing
Encode decode roundtrip on an image (512×512, CPU):
Checklist: