Skip to content

Commit 82d3b3e

Browse files
authored
Merge pull request #67 from ibelem/z-image
blog - from-u-net-to-dit-z-image-turbo-runs-in-your-browser
2 parents d1a1cf2 + cbed8fb commit 82d3b3e

6 files changed

Lines changed: 253 additions & 0 deletions
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
---
2+
title: 'From U-Net to DiT: Z-Image Turbo Runs in Your Browser'
3+
description:
4+
'Z-Image Turbo is a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) running entirely
5+
in the browser via WebGPU. Intel Web Platform Engineering adapted the model through ONNX conversion,
6+
INT4/FP16 quantization, and operator fusion — achieving a 7x size reduction and up to 7x inference
7+
speedup for real-time, on-device text-to-image generation on AI PC hardware.'
8+
date: 2026-04-16
9+
authors:
10+
- name: Jianhui Dai
11+
link: https://github.com/daijh
12+
- name: Wanming Lin
13+
link: https://github.com/honry
14+
- name: Belem Zhang
15+
link: https://github.com/ibelem
16+
---
17+
18+
import { TopContent } from '../../../app/_components/authors'
19+
20+
<TopContent lang={props.params.lang} {...metadata} />
21+
22+
Over the past few years, the Intel Web Platform Engineering team has pushed the boundary of what is possible in the browser for generative AI. We were among the first to run Stable Diffusion Turbo and SDXL Turbo fully in-browser using WebGPU and WebNN — no server, no cloud, just your device. Today we are sharing the next chapter: Z-Image-Turbo running natively in the browser via WebGPU on AI PC hardware, which is a generational leap in model quality, architecture, and capability.
23+
24+
This required solving a new class of problems. Earlier models were U-Net based; Z-Image Turbo is a Scalable Single-Stream Diffusion Transformer (S3-DiT) — a fundamentally different architecture that demanded a fresh approach to model conversion, quantization, and operator fusion for the web runtime.
25+
26+
## Z-Image-Turbo at a Glance
27+
28+
Z-Image-Turbo is an open-weights text-to-image model built for high-quality, on-device generation on consumer AI hardware — proving that browser-native image generation can match modern prompt fidelity and visual quality without a cloud round-trip.
29+
30+
Where earlier pipelines relied on U-Net, a convolutional backbone tuned for local spatial features, Z-Image-Turbo adopts a Scalable Single-Stream Diffusion Transformer (S3-DiT) that processes text and image tokens in a single unified attention stream. The full pipeline chains three components: a **Qwen3-4B text encoder** for prompt understanding, the **S3-DiT backbone** for latent denoising, and a **FLUX VAE decoder** for pixel reconstruction. Because compute shifts from convolutions to Transformer operators — attention and large matrix multiplies — kernel- and graph-level optimization becomes the primary deployment lever.
31+
32+
![Z-Image Turbo architecture overview](/blog/z-image-turbo/pipeline.png)
33+
34+
This design didn't emerge in a vacuum. As the table below shows, the broader ecosystem has moved decisively toward DiT-family architectures. Z-Image-Turbo's S3-DiT follows that trend — and at 6B parameters, it represents the current state of the art among open models optimized for on-device deployment.
35+
36+
| Model | Release | Size (Parameters) | Architecture |
37+
| --- | --- | --- | --- |
38+
| Stable Diffusion 1.5 | 2022 | ~860M | Latent Diffusion (U-Net) |
39+
| Stable Diffusion XL (SDXL) | 2023 | 6.6B | Latent Diffusion (U-Net) |
40+
| Stable Diffusion 3 (Medium/Large) | 2024 | 2B / 8B | Multimodal Diffusion Transformer (MMDiT) |
41+
| FLUX.1 [dev] / [schnell] | 2024 | 12B | Hybrid DiT (Double-Stream + Single-Stream) |
42+
| Qwen-Image | 2025 | 20B | Multimodal Diffusion Transformer (MMDiT) |
43+
| Z-Image-Turbo | 2025 Nov | 6B | Single-Stream Diffusion Transformer (S3-DiT) |
44+
45+
The rest of this post explains what we did to make it viable in the browser.
46+
47+
## Z‑Image Turbo in the Browser: Deployment and Optimization
48+
49+
Deploying Z‑Image Turbo in the browser requires adapting the native diffusion transformer to the Web under strict constraints on model format, memory footprint, and execution efficiency. In this section, we describe the key deployment and optimization steps that make this adaptation possible.
50+
51+
### Model Conversion and Optimization
52+
53+
Adapting the native transformer-based model to the web requires a series of model preparation steps, including format conversion, memory reduction, and execution-oriented optimization.
54+
55+
#### Step 1: ONNX Conversion
56+
57+
We first convert the native transformer-based model into ONNX format, so it can be executed by ONNX Runtime Web using the WebGPU execution provider. Compared to U-Net architectures, transformer models require special handling to preserve the unified token sequence and attention structure during export.
58+
59+
#### Step 2: Size Reduction via Quantization
60+
61+
Running a modern diffusion transformer in the browser requires aggressive model compression to fit within the following key constraints of web runtimes:
62+
63+
- **ONNX Runtime Web (Wasm):** limits model size to 4 GB per session
64+
- **Chrome:** limits the GPU process sandbox's access to physical memory on Windows
65+
66+
To meet these constraints without sacrificing image quality, we apply a layered quantization strategy that combines aggressive weight compression with mixed‑precision execution.
67+
68+
**INT4 Quantization**
69+
70+
We quantize MatMul weights to INT4 and execute them using the `MatMulNBits` operator. For token embeddings (`embed_tokens`), we apply `GatherBlockQuantized`, which preserves lookup semantics while significantly reducing the weight footprint.
71+
72+
**FP16 Quantization**
73+
74+
The model is converted from float32 to float16 throughout. A small set of operations remains in float32 to prevent intermediate tensors from exceeding the float16 dynamic range, which is critical for maintaining numerical stability in the long attention sequences of S3‑DiT.
75+
76+
#### Step 3: Operator Fusion
77+
78+
To achieve practical throughput on WebGPU, we apply operator fusion to reduce GPU dispatch overhead and improve memory locality. By executing multiple transformer operations within a single dispatch, WebGPU enables efficient use of hardware-level operator support and delivers substantial end-to-end performance gains.
79+
80+
We fuse the following operator groups for the Z-Image Turbo web deployment:
81+
82+
| Fused Operator | Category | Performance Benefit |
83+
| --- | --- | --- |
84+
| MatMulNBits | INT4 Linear | Reduces weight memory and bandwidth |
85+
| GroupQueryAttention | Attention | Fused QKV dispatch |
86+
| MultiHeadAttention | Attention | Cross-modal fusion efficiency |
87+
| RotaryEmbedding | Position Encoding | Eliminates separate kernel overhead |
88+
| LayerNorm / SimplifiedLayerNorm | Normalization | Reduces memory round-trips |
89+
| GatherBlockQuantized | Embedding | INT4 lookup efficiency |
90+
91+
#### Summary
92+
93+
In summary, quantization significantly reduces model complexity by **54%** and shrinks model size by **7x**, while operator fusion delivers up to **7x** inference speedup — making real‑time, in-browser transformer‑based image generation feasible on AI PC hardware.
94+
95+
### End-to-End Inference Pipeline in the Browser
96+
97+
With the optimized model in place, the remaining challenge is executing the full diffusion workflow efficiently inside the browser. In‑browser inference requires carefully orchestrating multiple model components under tight constraints on memory movement and GPU dispatch overhead.
98+
99+
The following figure illustrates the end‑to‑end inference pipeline used to run Z‑Image Turbo entirely on‑device using WebGPU. The pipeline consists of four main stages: text encoding, iterative denoising, image decoding, and image rendering. The core diffusion process runs inside a tight denoising loop, where the transformer model and scheduler are executed repeatedly across diffusion timesteps.
100+
101+
![Z-Image Turbo end-to-end inference pipeline](/blog/z-image-turbo/z-image-web-pipe.jpg)
102+
103+
Several characteristics are critical for achieving practical performance in the browser:
104+
105+
- The denoising loop forms the performance‑critical path and benefits most from the model‑level optimizations described earlier.
106+
- WebGPU enables the complete diffusion pipeline to run entirely on‑device as a single, end‑to‑end browser inference workflow.
107+
- I/O binding is used across stages to reduce unnecessary memory copies between model executions.
108+
109+
## Hardware Target: Intel Core Ultra Series 3 (Panther Lake)
110+
111+
Our optimized pipeline is validated on Intel Core Ultra Series 3 (Panther Lake) AI PC devices, where the WebGPU backend delivers its strongest results. The integrated GPU architecture and dedicated NPU in these chips align well with the fused-operator dispatch pattern of our pipeline — meaning users on current-generation AI PCs get a genuinely fast, responsive generation experience without leaving the browser.
112+
113+
This represents the convergence of two trends our team has tracked for years: increasingly capable client-side ML hardware, and an increasingly powerful web ML stack. Z-Image Turbo on WebGPU is a demonstration of what is now possible at their intersection.
114+
115+
## Try It Live
116+
117+
The Z-Image Turbo web demo and full open-source implementation are publicly available:
118+
119+
- **Live Demo:** [https://microsoft.github.io/webnn-developer-preview/demos/z-image-turbo/](https://microsoft.github.io/webnn-developer-preview/demos/z-image-turbo/)
120+
121+
![Z-Image Turbo sample output generated in-browser](/blog/z-image-turbo/output.png)
122+
123+
The image above was generated directly in the browser using this demo. No setup required — open the demo in a WebGPU-capable browser on a compatible AI PC and start generating images entirely on-device.
124+
125+
## Conclusion
126+
127+
With Z‑Image Turbo, we demonstrate that state‑of‑the‑art diffusion transformers can run entirely in the browser through fully on‑device inference, without relying on server‑side execution. Enabled by WebGPU‑optimized execution on AI PC, this work bridges the gap between SOTA generative models and practical, private, client‑side web deployment.
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
---
2+
title: '从 U-Net 到 DiT:在浏览器中运行 Z-Image Turbo'
3+
description:
4+
'Z-Image Turbo 是一个 60 亿参数的可扩展单流扩散 Transformer(S3-DiT)模型,完全通过 WebGPU 在浏览器中运行。
5+
英特尔 Web 平台工程团队通过 ONNX 转换、INT4/FP16 量化和算子融合对模型进行了适配——实现了 7 倍的模型体积缩减和最高 7 倍的推理加速,
6+
在 AI PC 硬件上实现了实时、端侧的文生图生成。'
7+
date: 2026-04-16
8+
authors:
9+
- name: 戴建辉
10+
link: https://github.com/daijh
11+
- name: 林万明
12+
link: https://github.com/honry
13+
- name: 张敏
14+
link: https://github.com/ibelem
15+
---
16+
17+
import { TopContent } from '../../../app/_components/authors'
18+
19+
<TopContent lang={props.params.lang} {...metadata} />
20+
21+
在过去几年中,英特尔 Web 平台工程团队不断突破浏览器中生成式 AI 的能力边界。我们是最早使用 WebGPU 和 WebNN 在浏览器中完整运行 Stable Diffusion Turbo 和 SDXL Turbo 的团队之一——无需服务器、无需云端,仅依赖用户的设备。今天,我们分享新的篇章:Z-Image-Turbo 通过 WebGPU 在 AI PC 硬件上原生运行于浏览器中,这是模型质量、架构和能力的一次代际飞跃。
22+
23+
这需要解决一系列全新的问题。早期模型基于 U-Net 架构;而 Z-Image Turbo 采用了可扩展单流扩散 Transformer(S3-DiT)——一种完全不同的架构,需要全新的模型转换、量化和算子融合方案来适配 Web 运行时。
24+
25+
## Z-Image-Turbo 概览
26+
27+
Z-Image-Turbo 是一个开放权重的文生图模型,专为消费级 AI 硬件上的高质量端侧生成而构建——证明了浏览器原生的图像生成能够在不依赖云端的情况下,实现与现代提示词匹配的保真度和视觉质量。
28+
29+
早期流水线依赖 U-Net,一种针对局部空间特征调优的卷积主干网络;而 Z-Image-Turbo 采用了可扩展单流扩散 Transformer(S3-DiT),在单一统一的注意力流中处理文本和图像 token。完整流水线串联三个组件:用于提示词理解的 **Qwen3-4B 文本编码器**、用于潜空间去噪的 **S3-DiT 主干网络**,以及用于像素重建的 **FLUX VAE 解码器**。由于计算从卷积转移到 Transformer 算子——注意力机制和大规模矩阵乘法——内核级和图级优化成为部署的关键手段。
30+
31+
![Z-Image Turbo 架构概览](/blog/z-image-turbo/pipeline.png)
32+
33+
这一设计并非凭空而来。如下表所示,更广泛的生态系统已经果断转向 DiT 家族架构。Z-Image-Turbo 的 S3-DiT 遵循了这一趋势——以 60 亿参数的规模,代表了面向端侧部署优化的开放模型的当前最先进水平。
34+
35+
| 模型 | 发布时间 | 规模(参数量) | 架构 |
36+
| --- | --- | --- | --- |
37+
| Stable Diffusion 1.5 | 2022 | ~8.6 亿 | 潜扩散模型(U-Net) |
38+
| Stable Diffusion XL (SDXL) | 2023 | 66 亿 | 潜扩散模型(U-Net) |
39+
| Stable Diffusion 3 (Medium/Large) | 2024 | 20 亿 / 80 亿 | 多模态扩散 Transformer(MMDiT) |
40+
| FLUX.1 [dev] / [schnell] | 2024 | 120 亿 | 混合 DiT(双流 + 单流) |
41+
| Qwen-Image | 2025 | 200 亿 | 多模态扩散 Transformer(MMDiT) |
42+
| Z-Image-Turbo | 2025 年 11 月 | 60 亿 | 单流扩散 Transformer(S3-DiT) |
43+
44+
本文的其余部分将介绍我们为使其在浏览器中可行所做的工作。
45+
46+
## Z‑Image Turbo 浏览器部署与优化
47+
48+
在浏览器中部署 Z‑Image Turbo 需要在模型格式、内存占用和执行效率的严格约束下,将原生扩散 Transformer 适配到 Web 环境。本节描述了实现这一适配的关键部署和优化步骤。
49+
50+
### 模型转换与优化
51+
52+
将原生的基于 Transformer 的模型适配到 Web 端需要一系列模型准备步骤,包括格式转换、内存缩减和面向执行的优化。
53+
54+
#### 步骤 1:ONNX 转换
55+
56+
我们首先将原生的基于 Transformer 的模型转换为 ONNX 格式,以便通过 ONNX Runtime Web 使用 WebGPU 执行提供程序来运行。与 U-Net 架构相比,Transformer 模型在导出过程中需要特殊处理以保留统一的 token 序列和注意力结构。
57+
58+
#### 步骤 2:通过量化缩减模型体积
59+
60+
在浏览器中运行现代扩散 Transformer 需要积极的模型压缩,以满足 Web 运行时的以下关键约束:
61+
62+
- **ONNX Runtime Web (Wasm):** 限制每个会话的模型大小为 4 GB
63+
- **Chrome:** 限制 Windows 上 GPU 进程沙箱对物理内存的访问
64+
65+
为了在不牺牲图像质量的前提下满足这些约束,我们采用了分层量化策略,将激进的权重压缩与混合精度执行相结合。
66+
67+
**INT4 量化**
68+
69+
我们将 MatMul 权重量化为 INT4,并使用 `MatMulNBits` 算子执行。对于 token 嵌入(`embed_tokens`),我们应用 `GatherBlockQuantized`,在显著减少权重占用的同时保留查找语义。
70+
71+
**FP16 量化**
72+
73+
模型整体从 float32 转换为 float16。少量运算保留 float32 以防止中间张量超出 float16 的动态范围,这对于维持 S3‑DiT 长注意力序列中的数值稳定性至关重要。
74+
75+
#### 步骤 3:算子融合
76+
77+
为了在 WebGPU 上实现实用的吞吐量,我们应用算子融合来减少 GPU 调度开销并提升内存局部性。通过在单次调度中执行多个 Transformer 运算,WebGPU 能够高效利用硬件级算子支持,并带来显著的端到端性能提升。
78+
79+
我们为 Z-Image Turbo 的 Web 部署融合了以下算子组:
80+
81+
| 融合算子 | 类别 | 性能收益 |
82+
| --- | --- | --- |
83+
| MatMulNBits | INT4 线性层 | 减少权重内存和带宽占用 |
84+
| GroupQueryAttention | 注意力 | 融合 QKV 调度 |
85+
| MultiHeadAttention | 注意力 | 跨模态融合效率 |
86+
| RotaryEmbedding | 位置编码 | 消除独立内核开销 |
87+
| LayerNorm / SimplifiedLayerNorm | 归一化 | 减少内存往返 |
88+
| GatherBlockQuantized | 嵌入 | INT4 查找效率 |
89+
90+
#### 总结
91+
92+
总而言之,量化显著降低了模型复杂度达 **54%**,并将模型体积缩小了 **7 倍**,同时算子融合带来了最高 **7 倍**的推理加速——使得在 AI PC 硬件上实现实时的、基于浏览器的 Transformer 图像生成成为可能。
93+
94+
### 浏览器中的端到端推理流水线
95+
96+
在优化模型就绪后,剩余的挑战是在浏览器中高效执行完整的扩散工作流。浏览器内推理需要在内存移动和 GPU 调度开销的严格约束下,精心编排多个模型组件。
97+
98+
下图展示了使用 WebGPU 在设备上完整运行 Z‑Image Turbo 的端到端推理流水线。该流水线包含四个主要阶段:文本编码、迭代去噪、图像解码和图像渲染。核心扩散过程在一个紧凑的去噪循环中运行,Transformer 模型和调度器在各个扩散时间步上反复执行。
99+
100+
![Z-Image Turbo 端到端推理流水线](/blog/z-image-turbo/z-image-web-pipe.jpg)
101+
102+
以下几个特性对于在浏览器中实现实用性能至关重要:
103+
104+
- 去噪循环构成了性能关键路径,从前述的模型级优化中获益最大。
105+
- WebGPU 使完整的扩散流水线能够作为单一的端到端浏览器推理工作流完全在设备上运行。
106+
- 各阶段之间使用 I/O 绑定,以减少模型执行之间不必要的内存拷贝。
107+
108+
## 目标硬件:Intel 酷睿 Ultra Series 3(猎豹湖 Panther Lake)
109+
110+
我们的优化流水线在 Intel 酷睿 Ultra Series 3(猎豹湖 Panther Lake)AI PC 设备上进行了验证,WebGPU 后端在该平台上展现了最佳性能。这些芯片中的集成 GPU 架构和专用 NPU 与我们流水线的融合算子调度模式高度契合——这意味着当前一代 AI PC 的用户无需离开浏览器,即可获得真正快速、流畅的生成体验。
111+
112+
这代表了我们团队多年来跟踪的两大趋势的交汇:日益强大的客户端 ML 硬件,以及日益完善的 Web ML 技术栈。Z-Image Turbo 在 WebGPU 上的运行,展示了两者交汇之处的无限可能。
113+
114+
## 在线体验
115+
116+
Z-Image Turbo 的 Web 演示和完整开源实现已公开发布:
117+
118+
- **在线演示:** [https://microsoft.github.io/webnn-developer-preview/demos/z-image-turbo/](https://microsoft.github.io/webnn-developer-preview/demos/z-image-turbo/)
119+
120+
![Z-Image Turbo 在浏览器中生成的示例输出](/blog/z-image-turbo/output.png)
121+
122+
上图是使用该演示直接在浏览器中生成的。无需任何设置——在兼容的 AI PC 上使用支持 WebGPU 的浏览器打开演示,即可开始完全在设备端生成图像。
123+
124+
## 结论
125+
126+
通过 Z‑Image Turbo,我们证明了最先进的扩散 Transformer 可以通过完全端侧推理在浏览器中运行,而无需依赖服务器端执行。借助 AI PC 上经过 WebGPU 优化的执行,这项工作弥合了前沿生成式模型与实用、隐私、客户端 Web 部署之间的鸿沟。
Binary file not shown.
758 KB
Loading
518 KB
Loading
52 KB
Loading

0 commit comments

Comments
 (0)