CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding

Wenxuan Song1*       Jiayi Chen1*       Pengxiang Ding2,3†       Yuxin Huang1       Han Zhao2,3       Xinhu Zheng1       Donglin Wang2       Gang Hua4       Haoang Li1‡      
1IRPN Lab, The Hong Kong University of Science and Technology (Guangzhou)
2MiLab, Westlake University    3Zhejiang University    4Amazon.com, Inc.
*Equal Contribution   Project Leader   Corresponding Author

Abstract

In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in a single iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves 4× inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics.

Introduce

CEED-VLA Model Architecture
  • (1) We propose CEED-VLA, a universal acceleration method that achieves significant inference speedup while maintaining manipulation performance.
  • (2) We introduce a consistency distillation process and a mixed-label supervision in the autoregressive loss to retain high-quality actions.
  • (3) We identify inefficient iteration as the bottleneck of Jacobi decoding and propose an early-exit decoding strategy, leading to 4.1× speedup and over 4.3× decoding frequency.

Method

CEED-VLA Model Architecture

The architecture of CEED-VLA with consistency distillation and early-exit decoding.

Jacobi Trajectory Collection

CEED-VLA Model Architecture

The Jacobi trajectory collection process.

For the target VLA \( P \), we let \( Q_\theta(\cdot \mid \mathbf{x}) \) denote the CEED-VLA model with parameters \( \theta \) initialized from \( P \). To capture the inherent consistency within Jacobi trajectories, we first collect them by prompting \( P \) to predict actions using Jacobi decoding on the robot dataset \( \mathcal{C} \).

Consistency Training

The consistency training procedure optimizes two objectives: (i) guiding the model to predict multiple correct tokens simultaneously, and (ii) constraining CEED-VLA from drifting away from the target VLA distribution to preserve manipulation skills.

1. Consistency Loss

We design a consistency loss to ensure that any intermediate point \( \mathcal{Y} \) on the Jacobi trajectory \( \mathcal{J} \) is mapped to the fixed point \( \mathcal{Y}^* \). This encourages the model to converge efficiently.

\[ \mathcal{L}_{\text{C}} = \mathbb{E}_{(\mathbf{x}, \mathcal{J}) \sim \mathcal{D}, \mathcal{Y} \sim \mathcal{J}} \left[ \sum_{i=1}^n \text{KL}\left( Q_{\theta^{-}}(\cdot|\mathcal{Y}_i^*, \mathbf{x}) \,\|\, Q_{\theta}(\cdot|\mathcal{Y}_i, \mathbf{x}) \right) \right] \]

where \( \theta^{-} = \text{stopgrad}(\theta) \) and KL denotes the forward KL divergence.

2. Mixed-label AR Supervision

To maintain autoregressive behavior, we retain the AR loss from the teacher model:

$$ \mathcal{L}_{\text{AR}} = \mathbb{E}_{(\mathbf{x}, \mathcal{Y}^{*}) \sim \mathcal{D}} \left[ -\sum_{i=1}^N \log Q_{\theta}\left( \mathcal{Y}^{*}_{i} \mid \mathcal{Y}^{*}_{\lt i}, \mathbf{x} \right) \right] $$

For outliers with large deviation (measured by L1 distance), the AR labels are replaced by corresponding ground-truth values.

3. Total Loss

\[ \mathcal{L}(\theta) = \mathcal{L}_{\text{C}} + \omega \cdot \mathcal{L}_{\text{AR}} \]

where \( \omega \) is a weighting coefficient balancing consistency and AR supervision.

Early-Exit Decoding

CEED-VLA Model Architecture

The comparison of four decoding methods.

To address the inefficiency of Jacobi decoding caused by strict convergence requirements, we propose early-exit decoding. Instead of waiting for full convergence,the model exits early at a predefined step and directly outputs the intermediate result. This strategy leverages the observation that late-stage token updates are often minor and have limited impact on task success. By relaxing the stopping condition, early-exit decoding significantly improves both the average and minimum inference speeds while preserving manipulation accuracy, enabling high-frequency robotic control with minimal delay.

Experiment

CEED-VLA Model Architecture

Main experiment in simulation.

Evaluation on CALVIN ABC->D Benchmark

Task 1: rotate_blue_block_right

Task 2: move_slider_right

Task 3: lift_red_block_slider

Task 4:
place_in_slider

Task 5: turn_off_lightbulb

Image 1
Image 2
Image 3
Image 4

Evaluation on LIBERO LONG Benchmark

Task 1: pick up the book and place it in the back compartment of the caddy

Task 2: put both moka pots on the stove

Task 3: put both the alphabet soup and the cream cheese box in the basket

Task 4: put both the alphabet soup and the tomato sauce in the basket

Task 5: put both the cream cheese box and the butter in the basket

Task 6: put the black bowl in the bottom drawer of the cabinet and close it

Task 7: put the white mug on the left plate and put the yellow and white mug on the right plate

Task 8: put the white mug on the plate and put the chocolate pudding to the right of the plate

Task 9: put the white mug on the plate and put the chocolate pudding to the right of the plate

Task 10: turn on the stove and put the moka pot on it

CEED-VLA Model Architecture

The ablation of CEED-VLA base on Open-VLA.

Evaluation on Real World

CEED-VLA Model Architecture

The real world setup.

Task: Fold the towel

The top two videos demonstrate the performance of the LLaVA-VLA model in a real-world setting. The robotic arm operates at a relatively low frequency and struggles with dexterous manipulation tasks such as towel folding, often failing to grasp the towel properly or only grabbing one edge, which leads to task failure.

The bottom two videos show the experimental results of our proposed CEED-VLA model. Thanks to its faster inference speed, the robotic arm operates more smoothly and successfully completes the dexterous manipulation tasks.

Visual generalization. (Fold the towel with unseen color)

CEED-VLA Model Architecture

The real world results.

BibTeX


      @article{song2025ceed,
        title={CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding},
        author={Song, Wenxuan and Chen, Jiayi and Ding, Pengxiang and Huang, Yuxin and Zhao, Han and Wang, Donglin and Li, Haoang},
        journal={arXiv preprint arXiv:2506.13725},
        year={2025}
      }