Exploring the Future of Dexterous Robotics with Vision-Language-Action Models

Advances in robotics are unlocking new possibilities in automating complex, high-variability tasks that were once deemed too challenging for conventional programming methods. At Southwest Research Institute (SwRI), we are pushing the boundaries of robotic dexterity through the use of innovative imitation learning techniques powered by diffusion models and vision-language-action (VLA) architectures. Vision-language-action models combine large language models (LLMs) with vision tools, enabling robots to learn through interactions with multimodal datasets of images and video.

In one recent exploratory project, we taught robots dexterous skills by mimicking human demonstrations. This approach leverages diffusion models to imitate training examples provided by humans, programming robot motions without explicit knowledge of its surroundings and the associated physics of those surroundings. In this setup, the robot generates end-effector Cartesian positions directly from pixel inputs within its planning horizon, simplifying learning processes for tasks involving high-degree-of-freedom manipulator motions. While promising, this method comes with limitations in capability due to its foundational reliance on context from visual data alone.

Four photos laid out in a grid with left photo showing example data collection and right photos showing robotic setup

SwRI used vision-language-action (VLA) techniques to enable robots to learn through interactions with multimodal datasets of images and video. Left: demonstration of the assembly task by the operator with the training tool. Right: subsequent robot execution after training.

Pretraining VLA Models & Integrating Digital Twins

To take the next step, we are developing systems that involve pre-trained VLA models, artificial intelligence systems trained on internet-scale robotic demonstration datasets such as Gr00t, Toyota Research Institute’s Large Behavior Models (LBMs), Google’s Gemini Robotics and Generalist. Using an Apple Vision Pro headset, we teleoperate two collaborative robot arms (UR5Es with Robotiq grippers) to provide hands-on training examples. From there, we augment these initial datasets using synthetic data simulations from platforms like SkillGen, Cosmos-Transfer and Marble by World Labs. This method enables scalability while maintaining precision and opens doors to applications such as industrial conveyor sorting, high-mix assembly lines, and vehicle refueling systems.

In addition to the teleoperation hardware setup, we have also developed a digital twin of the environment in NVidia’s Isaac Sim simulator. This adds to our capabilities and enables us to use the synthetic data simulation platforms mentioned above for visual and action data augmentation. The video below shows a live demonstration of a ROS 2 implementation of a cartesian impedance controller running on both robots linked to a minimal digital twin setup in Isaac Sim. This demonstration enables data augmentation and variation through a digital twin.

Robustness and failure recovery are key focuses for these projects, where synthetic data from digital twins simulates real-world variability and enriches imitation learning pipelines. By pairing image inputs from cameras (two wrist-mounted and one overhead scene camera) with recorded robot actions, we aim to train policies capable of complex, real-world tasks. While early results show potential to improve accuracy and repeatability, there remains work to ensure consistency at industrial scale.

These innovations reflect a broader ambition to revolutionize robotics through cutting-edge AI strategies. By exploring synthetic data generation, teleoperation tools and scalable imitation learning, we are charting a path to deploy adaptable, high-performing robotic solutions across industries. Stay tuned as we continue to unlock the future of intelligent automation.

Questions about this article? Contact Paul Evans or call +1 210 522 2994.