In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in a single iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves 4× inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics.
For the target VLA \( P \), we let \( Q_\theta(\cdot \mid \mathbf{x}) \) denote the CEED-VLA model with parameters \( \theta \) initialized from \( P \). To capture the inherent consistency within Jacobi trajectories, we first collect them by prompting \( P \) to predict actions using Jacobi decoding on the robot dataset \( \mathcal{C} \).
The consistency training procedure optimizes two objectives: (i) guiding the model to predict multiple correct tokens simultaneously, and (ii) constraining CEED-VLA from drifting away from the target VLA distribution to preserve manipulation skills.
We design a consistency loss to ensure that any intermediate point \( \mathcal{Y} \) on the Jacobi trajectory \( \mathcal{J} \) is mapped to the fixed point \( \mathcal{Y}^* \). This encourages the model to converge efficiently.
\[ \mathcal{L}_{\text{C}} = \mathbb{E}_{(\mathbf{x}, \mathcal{J}) \sim \mathcal{D}, \mathcal{Y} \sim \mathcal{J}} \left[ \sum_{i=1}^n \text{KL}\left( Q_{\theta^{-}}(\cdot|\mathcal{Y}_i^*, \mathbf{x}) \,\|\, Q_{\theta}(\cdot|\mathcal{Y}_i, \mathbf{x}) \right) \right] \]
where \( \theta^{-} = \text{stopgrad}(\theta) \) and KL denotes the forward KL divergence.
To maintain autoregressive behavior, we retain the AR loss from the teacher model:
$$ \mathcal{L}_{\text{AR}} = \mathbb{E}_{(\mathbf{x}, \mathcal{Y}^{*}) \sim \mathcal{D}} \left[ -\sum_{i=1}^N \log Q_{\theta}\left( \mathcal{Y}^{*}_{i} \mid \mathcal{Y}^{*}_{\lt i}, \mathbf{x} \right) \right] $$
For outliers with large deviation (measured by L1 distance), the AR labels are replaced by corresponding ground-truth values.
\[ \mathcal{L}(\theta) = \mathcal{L}_{\text{C}} + \omega \cdot \mathcal{L}_{\text{AR}} \]
where \( \omega \) is a weighting coefficient balancing consistency and AR supervision.
To address the inefficiency of Jacobi decoding caused by strict convergence requirements, we propose early-exit decoding. Instead of waiting for full convergence,the model exits early at a predefined step and directly outputs the intermediate result. This strategy leverages the observation that late-stage token updates are often minor and have limited impact on task success. By relaxing the stopping condition, early-exit decoding significantly improves both the average and minimum inference speeds while preserving manipulation accuracy, enabling high-frequency robotic control with minimal delay.
Task 1: rotate_blue_block_right
Task 2: move_slider_right
Task 3: lift_red_block_slider
Task 4:
place_in_slider
Task 5: turn_off_lightbulb
Table 1: Description
Table 3: Description
Table 5: Description
Table 6: Description
Task 1: pick up the book and place it in the back compartment of the caddy
Task 2: put both moka pots on the stove
Task 3: put both the alphabet soup and the cream cheese box in the basket
Task 4: put both the alphabet soup and the tomato sauce in the basket
Task 5: put both the cream cheese box and the butter in the basket
Task 6: put the black bowl in the bottom drawer of the cabinet and close it
Task 7: put the white mug on the left plate and put the yellow and white mug on the right plate
Task 8: put the white mug on the plate and put the chocolate pudding to the right of the plate
Task 9: put the white mug on the plate and put the chocolate pudding to the right of the plate
Task 10: turn on the stove and put the moka pot on it
@article{song2025ceedvla,
title={CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding},
author={Song, Wenxuan and Chen, Jiayi and Ding, Pengxiang and Huang, Yuxin and Zhao, Han and Wang, Donglin and Li, Haoang and others},
journal={arXiv preprint arXiv:2506.13725},
year={2025}
}