MarkTechPost@AI 06月04日 01:40
Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Hugging Face 推出了 SmolVLA,一款紧凑型视觉-语言-动作 (VLA) 模型,旨在降低机器人控制的硬件和数据要求。与依赖大型 Transformer 模型的传统 VLA 模型不同,SmolVLA 专注于经济高效和部署效率,可在单 GPU 或 CPU 环境中运行。该模型基于社区收集的数据进行训练,集成了精简的预训练视觉-语言模型和基于 Transformer 的动作专家,实现了低延迟控制,适用于实时应用。SmolVLA 的开源特性为研究人员提供了可复现的框架,推动了机器人学习的发展。

🤖 SmolVLA 是一款轻量级视觉-语言-动作模型,专为经济高效的机器人控制而设计,可在单 GPU 或 CPU 环境中运行。

💡 该模型由两个主要部分组成:感知模块(SmolVLM-2)处理视觉和语言信息,动作专家预测连续的控制动作。通过层剪枝和块状动作预测,降低了计算需求。

🚀 SmolVLA 采用异步推理堆栈,将动作预测与执行解耦,减少了平均任务时间,并提高了边缘部署的效率。

✅ 在模拟和真实世界的测试中,SmolVLA 表现出色,在 LIBERO 和 Meta-World 模拟基准测试中取得了优异成绩,并在 SO100 和 SO101 平台上完成了 pick-place、堆叠和排序等真实世界任务。

🔓 SmolVLA 采用开源许可,提供代码、训练数据和部署工具,为高效且易于获取的机器人学习研究奠定了基础。

Despite recent progress in robotic control via large-scale vision-language-action (VLA) models, real-world deployment remains constrained by hardware and data requirements. Most VLA models depend on transformer-based backbones with billions of parameters, resulting in significant memory and compute costs. This limits experimentation to well-resourced labs and clouds, excluding practitioners working with lower-cost hardware. Additionally, much of the current progress in VLA research remains either proprietary or based on non-reproducible methodologies, impeding open research. Finally, data heterogeneity across robotic platforms—differences in morphology, sensors, and control modes—poses a further challenge to generalizability and cross-platform learning.

Hugging Face Introduces SmolVLA: A Lightweight, Open VLA Framework

Hugging Face presents SmolVLA, a compact vision-language-action model developed for affordability and deployment efficiency. Unlike conventional VLAs, SmolVLA is trained entirely on community-collected datasets and is optimized to run on single-GPU or CPU environments. The model architecture integrates a trimmed version of a pretrained vision-language model (SmolVLM-2) and a transformer-based action expert. This structure enables efficient low-level control from natural language instructions and RGB camera inputs.

A distinguishing feature of SmolVLA is its asynchronous inference stack, which decouples action prediction from execution. This design enables low-latency control suitable for real-time applications, even in resource-constrained settings. SmolVLA is released under an open license with accompanying code, training data, and deployment tools.

Architectural Overview and Design Trade-Offs

The SmolVLA model is structured into two primary components:

To reduce computational overhead, linear projections are used to align the modalities’ token dimensions. Action chunks are generated instead of single-step predictions, reducing the frequency of inference calls. The model is trained using bfloat16 precision and Torch’s JIT compilation for runtime optimization.

Empirical Evaluation: Simulation and Real-World Performance

SmolVLA is evaluated across both simulation benchmarks (LIBERO and Meta-World) and real-world robotic tasks using low-cost SO100 and SO101 platforms. The model is trained from scratch on ~23K episodes across 481 community datasets, with task labels auto-generated using a VLM. Evaluation metrics include task-level success rates under both in-distribution and out-of-distribution conditions.

In the LIBERO benchmark, SmolVLA (0.45B) achieves an average success rate of 87.3%, closely matching or surpassing larger models such as π₀ (3.3B). In Meta-World, the model outperforms diffusion policies and smaller-scale VLAs across task difficulty levels. These results are notable considering SmolVLA’s smaller training footprint and absence of robotics-specific pretraining.

In real-world settings, SmolVLA achieves average success rates of 78.3% across pick-place, stacking, and sorting tasks—outperforming both ACT (trained from scratch) and π₀ (finetuned). Moreover, SmolVLA generalizes across robotic embodiments, maintaining performance on SO101 despite training exclusively on SO100 data.

Performance Implications of Asynchronous Inference

SmolVLA’s asynchronous inference stack improves control efficiency by overlapping prediction and execution. Compared to traditional synchronous inference, this approach reduces average task time by ~30% and doubles the number of completed actions in fixed-time scenarios. This is particularly beneficial for edge deployments where inference delays degrade real-time performance.

Conclusion

SmolVLA demonstrates that compact, reproducible, and open-source VLA models can support competent robotic control on low-cost hardware. Through careful architectural choices—layer pruning, chunked action prediction, and asynchronous execution—SmolVLA maintains performance while significantly reducing computational demands.

The model’s open training and deployment stack, paired with real-world evaluations, offers a practical foundation for further research in efficient and accessible robot learning. Future directions include expanding cross-embodiment datasets, scaling model capacity without sacrificing latency, and exploring joint training on multimodal corpora beyond robotics data.


Check out the Paper and Model on Hugging Face . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SmolVLA 机器人 人工智能 开源
相关文章