Vision-Language-Action (VLA) Models: LLMs for robots

In the ever-evolving world of robotics, making machines interact with their environments as naturally and intelligently as humans do remains a central challenge. At Black Coffee Robotics, we've been exploring how large AI models can bridge that gap, especially through the emerging field of Vision-Language-Action (VLA) systems.

These models bring us one step closer to robotic assistants that can understand user intent from just an image and a sentence eliminating the need for handcrafting control policies or manually programming each behavior.

In this post, we dive into how we tested and fine-tuned a powerful open-source model, OpenVLA-7B, for real-world robotic manipulation tasks, including pick-and-place, drawer opening, and tool usage.

‍

Why Vision-Language-Action (VLA) Models?

Traditional robotic learning pipelines—such as Reinforcement Learning (RL) and behavior cloning often demand extensive domain-specific engineering, large datasets of robot-environment interactions, precisely shaped reward functions, and prolonged training using GPU resources. These constraints hinder generalization, scalability, and practical deployment in real-world settings.

Vision-Language-Action (VLA) models offer a compelling alternative by leveraging the synergy of multi-modal learning to overcome these limitations. Here's why they are becoming increasingly important in robotics:

Data Efficiency via Pretraining:
VLA models can be pretrained on large-scale internet or simulation datasets using vision-language pairs (e.g., images and captions or instructional videos), reducing the reliance on task-specific robot data. This pretraining helps bootstrap capabilities that generalize well to downstream robotic tasks with minimal fine-tuning.
Natural Instruction Following:
By incorporating natural language as an input modality, VLA models can interpret and execute high-level instructions like "pick up the red block" or "open the left drawer" a paradigm shift from manually coded action policies or symbolic planning pipelines.
Cross-Domain and Cross-Platform Generalization:
The shared embedding space for vision and language enables a single model to transfer knowledge across tasks, objects, and robotic embodiments. This means that a model trained in one domain (e.g., simulation or a different robot) can still be effective in another with limited adaptation.
Unified Policy Learning:
Instead of decoupling perception, task understanding, and control into separate modules, VLA models learn an end-to-end policy that jointly reasons about the scene, the goal, and how to act. This holistic approach improves robustness and simplifies the system design.

In short, VLA models present a promising foundation for building general-purpose robots capable of understanding and executing tasks in diverse and dynamic real-world environments making them a cornerstone of next-generation embodied AI.

‍

How We Applied OpenVLA

We used OpenVLA-7B, a transformer-based model with a vision encoder and action decoder that processes a 224×224 RGB image and predicts a 7-dimensional action vector (for 3D movement, rotation, and gripper control).

To evaluate its performance, we tested OpenVLA on three different environments with increasing complexity:

Simulation Environments

MuJoCo (OpenVLA default tasks): For initial evaluation and baseline results.
PyBullet (Pick task):
- Simple cube-picking using Widow-X arm
- CPU-friendly and easy to set up
Isaac Lab (Dishwasher unloading):
- High-fidelity physics and rendering
- Parallel GPU-based data collection
- Used Kinova Gen3 7-DOF arm

Data Collection Strategy

Episodic collection using RLDS format
Collected RGB image and matching action at each step
Tasks included:
- Pick task: 30 steps per episode, 30 episodes
- Dishwasher task: 60 steps per episode, 50 episodes
Control frequency: 5 Hz
All data collected via hand-designed motion policies, with injected noise for diversity and robustness

‍

Fine-tuning

Used LoRA (Low-Rank Adaptation) for efficient model tuning
Each task fine-tuned for ~10k steps (~4–5 hours)
Trained on a single GPU with 24 GB VRAM

‍

Results & Observations

Our experiments yielded a blend of exciting breakthroughs and instructive limitations.

Video Demonstration

Below is a demonstration of OpenVLA-7B in action, showcasing its capabilities across different tasks:

*OpenVLA-7B demonstrating pick-and-place task.*

OpenVLA-7B demonstrating dishwasher unloading task

Final Thoughts

‍

What Worked

Simple pick-and-place tasks:
After fine-tuning, the model achieved near-perfect success rates (~100%) in picking up objects in uncluttered environments like PyBullet.
Transferability:
The model adapted to different arms (Widow-X and Kinova) and environments with only modest task-specific data.
Ease of Use:
The workflow—from data collection to finetuning—was relatively smooth and repeatable. Once the pipeline was in place, training new tasks was quick.

‍

What Didn't Work Well (Yet)

Out-of-the-box performance:
Without fine-tuning, the pre-trained OpenVLA struggled with unseen environments (~0% success), reinforcing the importance of task adaptation.
Complex, contact-heavy manipulation:
Tasks like unloading a plate from a dishwasher which involve object-object collisions, precise trajectory planning, and fine motor control—only achieved 5–10% success, even after finetuning.

‍

Here are some examples of failure cases we observed:

Failure case 1 - Robot unable to grab the plate

‍

Failure case 2 - Plate falls when trying to take it out

‍

Lack of long-horizon planning:
Since OpenVLA predicts only the next step action, it struggles in tasks where future planning is critical. This is where newer models like , which predict action sequences, may outperform.
Data sensitivity:
We observed that the quality, diversity, and noise profile of the collected data dramatically affected learning. Adding variation helped, but too much noise or repetition degraded performance.

‍

Key Takeaways & Next Steps

Action precision matters:
Fine manipulation is hard. Collision-aware actions and tactile feedback may be needed for complex tasks.
Better data = better performance:
We're now focusing on refining our data filtering and augmentation pipelines, ensuring the model sees the right kind of variety during training.
Next-gen models are promising:
We're exploring Pi0 and similar models that:
- Accept multiple visual frames as input
- Predict full action sequences instead of just one step
Infrastructure is key:
To scale experiments, we're working on automated pipelines for data collection, visualization, and model debugging.

‍

Final Thoughts

VLA models like OpenVLA are still maturing—but they already offer a glimpse into a future where robots can learn faster, adapt better, and interact more naturally. Our journey with OpenVLA showed both the promise and the path forward.

At Black Coffee Robotics, we're excited to push the boundaries of intelligent robot control. If you're exploring AI-driven robotics, manipulation, or custom automation solutions, reach out to us!

Vision-Language-Action (VLA) Models: LLMs for robots

Why Vision-Language-Action (VLA) Models?

How We Applied OpenVLA

Simulation Environments

Data Collection Strategy

Fine-tuning

Results & Observations

Video Demonstration

What Worked

What Didn't Work Well (Yet)

Key Takeaways & Next Steps

Final Thoughts

Want to reduce costs and time to market for your autonomous robots?