Scaling by Diversified Experience for Vision-Language-Action Models

Leiyu Wang*,1,2, Zhaofengnian Wang*,2,4, Xueqi Li2,3, Luoyi Fan1,2, Cewu Lu1,2, Nanyang Ye†,1,2
1Shanghai Jiao Tong University   2Shanghai Innovation Institute   3Southern University of Science and Technology   4Tongji University
*Equal Contribution   Corresponding Author
Teaser figure

Abstract

Vision-Language-Action models face significant challenges in real-world deployment due to the entanglement of high-level reasoning with low-level control, and the instability of policy optimization. In this paper, we introduce SyVLA, a robust VLA model trained with diversified experiences. We propose an Intention Decoupling algorithm to isolate control-relevant features from reasoning contexts and a similar-sample guided RL pipeline to stabilize policy updates and mitigate distribution shift. Extensive experiments on real-world robotic tasks and multi-modal benchmarks demonstrate that SyVLA achieves superior task success rates and stronger out-of-distribution generalization compared to existing methods, while effectively preserving core vision-language capabilities.

Highlights

SyVLA architecture

We propose SyVLA, a novel Vision-Language-Action (VLA) model that achieves a strong balance between robotic task competence and the preservation of general vision-language capabilities.

Recent SOTA VLA models remain limited by two fundamental challenges: (1) decoupling control intention from high-level reasoning, and (2) enabling stable real-world reinforcement learning to bridge the gap between imitation learning objectives and actual task success.

To tackle these challenges, we introduce an Intention Decoupling algorithm that gradient-masks control-irrelevant Feature Query Tokens, preventing the Action Expert from being misled by high-level reasoning information. We further develop a Similar-Sample Guided RL pipeline, which retrieves semantically similar expert trajectories to stabilize policy updates and mitigate policy collapse during real-world reinforcement learning.

Extensive experiments are conducted on three challenging real-world tasks (folding shirts under location change, solving arithmetic instructions while wrapping cubes, and bagging snacks with vague natural language commands) as well as on multiple multi-modal benchmarks. Remarkably, SyVLA achieves the highest average success rates in both in-domain and out-of-distribution settings, using less than 5% of the pretraining data of prior industrial models. At the same time, it preserves strong vision-language reasoning and significantly outperforms other VLA models on visual question answering benchmarks.

Overall, this work demonstrates that with careful architectural design and stable RL integration, VLA models can excel simultaneously at dexterous manipulation and general-purpose multimodal understanding — a substantial step toward open, generalizable embodied intelligence.

Real-World Robot Results

Method In Domain Out of Distribution
Task 1Task 2Task 3Avg. Success Rate Task 1Task 2Task 3Avg. Success Rate
OpenVLA-oft 0.710.290.360.45 0.550.110.070.24
GR00T 0.710.210.210.38 0.640.070.000.24
Wall-Oss 0.500.180.140.27 0.140.040.000.06
Pi0 (pretrained) 0.930.390.570.63 0.780.290.360.48
Pi0 (from scratch) 0.640.210.290.38 0.500.140.140.26
ChatVLA 0.210.290.140.21 0.000.210.070.09
SyVLA (ours) 0.860.680.640.73 0.780.570.570.64

Results on real world tasks. We evaluate multiple models under both In Domain and OoD settings. The experimental results show that our SyVLA significantly outperforms other strong baselines and exhibits better generalization ability.

Multi-modal Benchmarks

DocVQAAI2DMMMUMMEHallBench
OpenVLA-oft-----
Wall-Oss63.6258.6037.111146.5636.57
GR00T-----
Pi0-----
ChatVLA83.3067.3637.40143539.90
Ours80.0167.7035.78179542.53
VQA example

Multi‑modal benchmark results. Our SyVLA outperforms baselines in commonsense and fine‑grained visual understanding, while remaining comparable on document and multidisciplinary knowledge. A dash (-) indicates no VQA capability.

Real-World Robot Demonstrations

Task1: Folding Shirts (2×)

Task2: Calculating & Wrapping (2×)

Task3: Bagging Snacks (2×)

BibTeX

@misc{wang2026scalingdiversifiedexperiencevisionlanguageaction,
      title={Scaling by Diversified Experience for Vision-Language-Action Models}, 
      author={Leiyu Wang and Zhaofengnian Wang and Xueqi Li and Luoyi Fan and Cewu Lu and Nanyang Ye},
      year={2026},
      eprint={2606.09009},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.09009}, 
}