You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p><em>Figure 1. Illustration of the "Propose, Simulate, Select" pipeline for Code2World enhanced GUI agent, exemplified by an AndroidWorld task. <strong>(1) Propose</strong>: The GUI agent generates K candidate actions, with <strong>red</strong>and <strong>green</strong> highlighting hallucinated/irrational reasoning and logically sound reasoning, respectively. <strong>(2) Simulate</strong>: Code2World predicts the execution result of each candidate via renderable code generation. <strong>(3) Select</strong>: By evaluating the rendered future states, the system identifies the potential failure in the original policy and rectifies the decision, ultimately selecting the optimal action that aligns with the user's intent.</em></p>
145
+
<p><em>Figure 1. <strong>Illustration of Code2World.</strong> Given a current GUI observation and an action, Code2World predicts the next screenshot via renderable code generation.</em></p>
<p><em>Figure 2. <strong>Left: Illustration of Data Synthesis.</strong> The high-fidelity <em>AndroidCode</em> dataset is curated via <em>constrainted initial synthesis</em> and a <em>visual-feedback revision loop</em>, where synthesized HTML is iteratively refined based on rendered visual discrepancies to ensure strict alignment (SigLIP score > 0.9). <strong>Right: Two-stage Model Optimization.</strong> The pipeline progresses from an SFT cold start to <em>Render-Aware Reinforcement Learning (RARL)</em>. Utilizing Group Relative Policy Optimization (GRPO), the model optimizes dual rewards—visual semantic (R<sub>sem</sub>) and action consistency (R<sub>act</sub>)—derived directly from <em>rendered outcomes</em> to enforce structural and logical fidelity.</em></p>
<p><em>Figure 3. Illustration of the "Propose, Simulate, Select" pipeline for Code2World enhanced GUI agent, exemplified by an AndroidWorld task. <strong>(1) Propose</strong>: The GUI agent generates K candidate actions, with <strong>red</strong> and <strong>green</strong> highlighting hallucinated/irrational reasoning and logically sound reasoning, respectively. <strong>(2) Simulate</strong>: Code2World predicts the execution result of each candidate via renderable code generation. <strong>(3) Select</strong>: By evaluating the rendered future states, the system identifies the potential failure in the original policy and rectifies the decision, ultimately selecting the optimal action that aligns with the user's intent.</em></p>
<p>If you find our project useful, we hope you can star our repo and cite our paper as follows:</p>
204
232
<preclass="bibtex-block">
205
233
<code>
206
-
11111
234
+
@article{zheng2026code2world,
235
+
title={Code2World: A GUI World Model via Renderable Code Generation},
236
+
author={Zheng, Yuhao and Zhong, Li'an and Wang, Yi and Dai, Rui and Liu, Kaikui and Chu, Xiangxiang and Lv, Linyuan and Torr, Philip and Lin, Kevin Qinghong},
0 commit comments