RationalVLA: A Dual-system Vision-Language-Action Model for Defective Instructions

Wenxuan Song*      Jiayi Chen*      Wenxue Li      Xu He      Han Zhao      Can Cui      Pengxiang Ding      Shiyan Su      Feilong Tang      Donglin Wang      Xuelian Cheng      Zongyuan Ge      Zhe Liu      Hesheng Wang      Yun-hui Liu      Haoang Li     
1The Hong Kong University of Science and Technology (Guangzhou)      2Westlake University
3Monash University      4Shanghai Jiao Tong University      5The Chinese University of Hong Kong
*Indicates Equal Contribution

Abstract

A fundamental requirement for real-world robotic deployment is the ability to understand and respond to natural language instructions. Existing language-conditioned manipulation tasks typically assume that instructions are perfectly aligned with the environment. This assumption limits robustness and generalization in realistic scenarios where instructions may be ambiguous, irrelevant, or infeasible. To address this problem, we introduce RAtional MAnipulation (RAMA), a new benchmark that challenges models with both unseen executable instructions and defective ones that should be rejected. In RAMA, we construct a dataset with over 14,000 samples, including diverse defective instructions spanning six dimensions: visual, physical, semantic, motion, safety, and out-of-context. We further propose the Rational Vision-Language-Action model (RationalVLA). It is a dual system for robotic arms that integrates the high-level vision-language model with the low-level manipulation policy by introducing learnable latent space embeddings. This design enables RationalVLA to reason over instructions, reject infeasible commands, and execute manipulation effectively. Experiments demonstrate that RationalVLA outperforms state-of-the-art baselines on RAMA by a 14.5% higher success rate and 0.94 average task length, while maintaining competitive performance on standard manipulation tasks. Real-world trials further validate its effectiveness and robustness in practical applications.

BibTeX


        @article{song2025rationalvla,
          title={RationalVLA: A Rational Vision-Language-Action Model with Dual System},
          author={Song, Wenxuan and Chen, Jiayi and Li, Wenxue and He, Xu and Zhao, Han and Su, Pengxiang Ding Shiyan and Tang, Feilong and Cheng, Xuelian and Wang, Donglin and Ge, Zongyuan and others},
          journal={arXiv preprint arXiv:2506.10826},
          year={2025}
        }