Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Jiayi Chen1*       Wenxuan Song1*       Pengxiang Ding2,3       Ziyang Zhou1       Han Zhao2,3       Feilong Tang1       Zhide Zhong1       Donglin Wang2       Haoang Li1†      
1IRPN Lab, The Hong Kong University of Science and Technology (Guangzhou)
2MiLab, Westlake University    3Zhejiang University   
*Equal Contribution   Corresponding Author

Introducing UD-VLA

Background.   Vision-language-action (VLA) models understand language and visuals to execute actions as embodied agents. Recent unified VLAs integrate future images for foresight-driven policies, but often rely on external experts or separate generation and action processes, limiting synergy. Our Unified Diffusion VLA (UD-VLA) jointly optimizes these via a synchronous Joint Discrete Denoising Diffusion Process (JD3P), built on unified tokenization and hybrid attention, enabling intrinsic cross-modal benefits with a two-stage training pipeline for efficiency.

Contributions.

  • We propose UD-VLA, which tightly couples understanding, generation, and acting in a mutually beneficial manner.
  • We instantiate this design via discrete tokenization, hybrid attention, and the JD3P process as the central mechanism for cross-modal synergy.
  • We design a two-stage training pipeline to activate the image generation capabilities and introduce several test-time techniques that ensure both high performance and efficiency.

1. Pipelines

Brief Description.   We first construct a unified multimodal space by quantizing multimodal information into discrete tokens. We then formalize a Joint Discrete Denoising Diffusion Process (JD3P) to allow visual generation and action prediction to be intrinsically synergistic. Moreover, we build our model on a pre-trained VLM and conduct two-stage training. Finally, we balance performance and efficiency during inference through several key techniques.

Framework
Figure 2. The pipeline of UD-VLA.

2. METHODS

2.1 UNIFIED DIFFUSION VLA

We convert language, vision, and action into discrete tokens and concatenate them into a single multimodal sequence with special markers <BOI>/<EOI> and <BOA>/<EOA>.

\[ [\text{text tokens} ; \text{current image tokens} ; \text{future image tokens} ; \text{action tokens}] \]

Text and current image use causal and bidirectional attention, respectively. Future-image tokens form a generation block and action tokens an acting block; each block is bidirectional, while cross-block attention is causal (acting attends to both input and generated image, no action-to-vision flow).

Hybrid Attention Mechanism

Actions and images are generated in parallel within one denoising step. With future-image tokens \(\mathbf{v}_0\) and action tokens \(\mathbf{a}_0\), the joint sequence is

\[ \mathbf{v}_0,\ \mathbf{a}_0=(v_{0,1},\dots,v_{0,L_v},a_{0,1},\dots,a_{0,L_a}). \]

We add a mask token \(\mathrm{M}\). The noising transition at step \(t\) is

\[ \mathbf{Q}_t\,\mathbf{e}_{t,r}=(1-\beta_t)\,\mathbf{e}_{t,r}+\beta_t\,\mathbf{e}_\mathrm{M}. \]

The denoising factorizes as

\[ p_\theta(\mathbf{v}_{t-1},\mathbf{a}_{t-1}\mid \mathbf{v}_t,\mathbf{a}_t,\mathbf{c}) = p_\theta(\mathbf{v}_{t-1}\mid \mathbf{v}_t,\mathbf{c})\; p_\theta(\mathbf{a}_{t-1}\mid \mathbf{v}_t,\mathbf{a}_t,\mathbf{c}). \]

We adopt a single-step mask-predict objective, computing cross-entropy only on masked positions:

\[ \mathcal{L}_{\text{CE}}(\theta) = - \beta \sum_{j}^{L_v} \log p_\theta^{(v)}(v_{0,j}\mid \mathbf{v}_t,\mathbf{c})\, \mathbf{1}\{v_{t,j}=\mathrm{M}\} - \sum_{i}^{L_a} \log p_\theta^{(a)}(a_{0,i}\mid \mathbf{v}_t,\mathbf{a}_t,\mathbf{c})\, \mathbf{1}\{a_{t,i}=\mathrm{M}\}. \]

2.2 Training

We initialize our model from a pretrained VLM backbone (Emu3), which is trained in an autoregressive manner with causal attention.

  1. Stage (i): Post-train on a large-scale video dataset to inject capabilities of future image generation into the VLA, enabling the robot to understand and model future states. The token sequences are constructed as:
  2. \[ [\;\text{text tokens}\;;\;\text{current image tokens}\;;\;\text{future image tokens}]. \]
  3. Stage (ii): Jointly optimize image and action generation under a unified framework on the downstream robot action dataset. The token sequences are constructed as the full multimodal sequence. Reformulate autoregressive decoding as a diffusion process following JD3P, adopting a shift operation strategy to predict the next token. This enables the model to inherit capacity learned from next-token-prediction training while gaining the advantages of bidirectional context and parallel decoding.

2.3 Inference

At inference, the denoising process uses parallel decoding with adaptive masking: all positions of \(\mathbf{v}_T\) and \(\mathbf{a}_T\) are initialized as <MASK>, token distributions for all positions are predicted in parallel, and this is repeated for a small number of iterations.

Prefix KV Cache and Pre-filling Tokens. Employ prefix KV-Cache to cache keys and values of current visual and prompt tokens. Pre-fill <BOI>, <EOI>, and <BOA> at corresponding positions to guide denoising of visual and action tokens, reducing latency and improving speed.

Confidence-Guided Decoding. Initialize with noise at \(t=T\) and iterate backward to \(t=0\) using cosine mask schedule \(\rho_t = \cos\!\left(\tfrac{\pi}{2}\cdot \tfrac{T+1-t}{T+1}\right)\). Rank masked positions by confidence score:

\[ q_{t-1,r}=\max_{\ell}\begin{cases} p_\theta\!\big(\ell \mid \mathbf v_t,\mathbf u\big), & r\in\{1,\ldots,L_v\},\\[6pt] p_\theta\!\big(\ell \mid \mathbf v_t,\mathbf a_t,\mathbf u\big), & r\in\{L_v+1,\ldots,L_v+L_a\}. \end{cases} \]

Update top \((1-\rho_t)|M_t|\) entries via tempered Gumbel sampling:

\[ v_{t-1,j},a_{t-1,i} = \arg\max\limits_{y}\Bigl[\tfrac{1}{\kappa_t}\log p_\theta(y \mid \mathbf v_t,\mathbf a_t,\mathbf u) + \eta_c\Bigr], \;\; \eta_c \stackrel{\text{i.i.d.}}{\sim} \mathrm{Gumbel}(0,1). \]

Decoding Space Mapping. Restrict classification space to each modality's codebook to prevent wrong-modality tokens and stabilize generation. Once <EOA> is predicted at \(i^\star\), set subsequent tokens to <PAD> and exclude from refinement.

3. Results

3.1 Numerical Comparison on The Different Benchmarks

3.1.1 LIBERO Benchmark

Task 1: Lift_pink_block_slider

Task 1: Pick up the book and place it in the back compartment of the caddy

Task 2: Place_in_slider

Task 2: Put both moka pots on the stove

Task 3: Open_drawer

Task 3: Put both the alphabet soup and the tomato sauce in the basket

Task 4: Rotate_red_block_right

Task 4: Put the black bowl in the bottom drawer of the cabinet and close it

Task 5: Lift_red_block_table

Task 5: Put the yellow and white mug in the microwave and close it

Framework

3.1.2 CALVIN Benchmark

Task 1: Lift_pink_block_slider

Task 1:
Lift pink block slider

Task 2: Place_in_slider

Task 2:
Place in slider

Task 3: Open_drawer

Task 3:
Open drawer

Task 4: Rotate_red_block_right

Task 4:
Rotate red block right

Task 5: Lift_red_block_table

Task 5:
Lift red block table

Framework

3.1.3 SimplerEnv-WidowX Benchmark

Task 1: Put carrot on plate

Task 2: Put poon on table

Task 3: Put spoon on tablecloth

Task 4:
Put eggplant in basket

Framework

3.1.4 In-depth Analysis

Effectiveness of Hybrid Attention Mechanism:

  • Causal denotes causal attention.
  • Bdirectional denotes bidirectional attention.
  • Hybrid (ours) denotes our proposed hybrid attention mechanism.

Attention settingAvg success length
Causal4.04
Bdirectional4.32
Hybrid (ours)4.64

Effectiveness of Future Image Generation:

  • Null denotes action prediction without visual generation.
  • Current Image denotes reconstruct current observation.
  • Predict future frames (ours) denotes predict future frames and actions.

SettingOutcome
No visual generation4.21
Reconstruct current frames4.39
Predict future frames (ours)4.64

Effectiveness of JD3P:

  • Autoregressive (AR) denotes autoregressive decoding.
  • Jacobi (AR) denotes Jacobi decoding.
  • Independent diffusion denotes independent diffusion decoding.
  • JD3P (ours) denotes our proposed JD3P decoding.

MethodAvg lengthSpeed
Autoregressive (AR)4.1850.2 (x1.0)
Jacobi (AR)4.16 (-0.02)101.6 (x2.0)
Independent diffusion4.35 (+0.19)144.4 (x2.9)
JD3P (ours)4.64 (+0.46)219.3 (x4.3)

3.4 Real-World Experiment

ReconVLA architecture

Setup and Tasks. The real-world setup uses a dual-camera configuration. We evaluate three manipulation categories—stacking bowls, putting blocks into a box, and flipping towers—covering varied colors and shapes across three backgrounds. For each category, 200 trajectories are collected at 15 Hz, with both seen and unseen settings for evaluation.

Results and Analysis. Across 30 trials per method, UD-VLA consistently surpasses GR00T N1 and UniVLA, achieving >80% success. On seen tasks, action quantization plus joint denoising yields precise actions. On unseen tasks, future-image prediction strengthens visual generalization and guides correct actions, whereas baselines degrade without explicit future visual reasoning.

3.4 Visualization of Generated Future Frames

ReconVLA architecture

The predicted frames follow instructions closely and capture task-level dynamics, remaining well aligned with the ground-truth trajectories. This consistency indicates that the model has internalized a temporal understanding of task logic, which in turn enhances temporal reasoning and enables more coherent action generation.

BibTeX

      
          @article{udvla2025,
          title={Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process},
          author={Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li},
          year={2025},
          url={https://arxiv.org/abs/}
        }