立即登录
v1.2.5
Background. Vision-language-action (VLA) models understand language and visuals to execute actions as embodied agents. Recent unified VLAs integrate future images for foresight-driven policies, but often rely on external experts or separate generation and action processes, limiting synergy. Our Unified Diffusion VLA (UD-VLA) jointly optimizes these via a synchronous Joint Discrete Denoising Diffusion Process (JD3P), built on unified tokenization and hybrid attention, enabling intrinsic cross-modal benefits with a two-stage training pipeline for efficiency.
Contributions.
Brief Description. We first construct a unified multimodal space by quantizing multimodal information into discrete tokens. We then formalize a Joint Discrete Denoising Diffusion Process (JD3P) to allow visual generation and action prediction to be intrinsically synergistic. Moreover, we build our model on a pre-trained VLM and conduct two-stage training. Finally, we balance performance and efficiency during inference through several key techniques.
We convert language, vision, and action into discrete tokens and concatenate them into a single multimodal sequence with special markers <BOI>
/<EOI>
and <BOA>
/<EOA>
.
Text and current image use causal and bidirectional attention, respectively. Future-image tokens form a generation block and action tokens an acting block; each block is bidirectional, while cross-block attention is causal (acting attends to both input and generated image, no action-to-vision flow).
Actions and images are generated in parallel within one denoising step. With future-image tokens \(\mathbf{v}_0\) and action tokens \(\mathbf{a}_0\), the joint sequence is
We add a mask token \(\mathrm{M}\). The noising transition at step \(t\) is
The denoising factorizes as
We adopt a single-step mask-predict objective, computing cross-entropy only on masked positions:
We initialize our model from a pretrained VLM backbone (Emu3), which is trained in an autoregressive manner with causal attention.
At inference, the denoising process uses parallel decoding with adaptive masking: all positions of \(\mathbf{v}_T\) and \(\mathbf{a}_T\) are initialized as <MASK>
, token distributions for all positions are predicted in parallel, and this is repeated for a small number of iterations.
Prefix KV Cache and Pre-filling Tokens. Employ prefix KV-Cache to cache keys and values of current visual and prompt tokens. Pre-fill <BOI>
, <EOI>
, and <BOA>
at corresponding positions to guide denoising of visual and action tokens, reducing latency and improving speed.
Confidence-Guided Decoding. Initialize with noise at \(t=T\) and iterate backward to \(t=0\) using cosine mask schedule \(\rho_t = \cos\!\left(\tfrac{\pi}{2}\cdot \tfrac{T+1-t}{T+1}\right)\). Rank masked positions by confidence score:
Update top \((1-\rho_t)|M_t|\) entries via tempered Gumbel sampling:
Decoding Space Mapping. Restrict classification space to each modality's codebook to prevent wrong-modality tokens and stabilize generation. Once <EOA>
is predicted at \(i^\star\), set subsequent tokens to <PAD>
and exclude from refinement.
Task 1: Pick up the book and place it in the back compartment of the caddy
Task 2: Put both moka pots on the stove
Task 3: Put both the alphabet soup and the tomato sauce in the basket
Task 4: Put the black bowl in the bottom drawer of the cabinet and close it
Task 5: Put the yellow and white mug in the microwave and close it
Task 1:
Lift pink block slider
Task 2:
Place in slider
Task 3:
Open drawer
Task 4:
Rotate red block right
Task 5:
Lift red block table
Task 1: Put carrot on plate
Task 2: Put poon on table
Task 3: Put spoon on tablecloth
Task 4:
Put eggplant in basket
Effectiveness of Hybrid Attention Mechanism:
Attention setting | Avg success length |
---|---|
Causal | 4.04 |
Bdirectional | 4.32 |
Hybrid (ours) | 4.64 |
Effectiveness of Future Image Generation:
Setting | Outcome |
---|---|
No visual generation | 4.21 |
Reconstruct current frames | 4.39 |
Predict future frames (ours) | 4.64 |
Effectiveness of JD3P:
Method | Avg length | Speed |
---|---|---|
Autoregressive (AR) | 4.18 | 50.2 (x1.0) |
Jacobi (AR) | 4.16 (-0.02) | 101.6 (x2.0) |
Independent diffusion | 4.35 (+0.19) | 144.4 (x2.9) |
JD3P (ours) | 4.64 (+0.46) | 219.3 (x4.3) |
Setup and Tasks. The real-world setup uses a dual-camera configuration. We evaluate three manipulation categories—stacking bowls, putting blocks into a box, and flipping towers—covering varied colors and shapes across three backgrounds. For each category, 200 trajectories are collected at 15 Hz, with both seen and unseen settings for evaluation.
Results and Analysis. Across 30 trials per method, UD-VLA consistently surpasses GR00T N1 and UniVLA, achieving >80% success. On seen tasks, action quantization plus joint denoising yields precise actions. On unseen tasks, future-image prediction strengthens visual generalization and guides correct actions, whereas baselines degrade without explicit future visual reasoning.
The predicted frames follow instructions closely and capture task-level dynamics, remaining well aligned with the ground-truth trajectories. This consistency indicates that the model has internalized a temporal understanding of task logic, which in turn enhances temporal reasoning and enables more coherent action generation.
@article{udvla2025,
title={Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process},
author={Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li},
year={2025},
url={https://arxiv.org/abs/}
}