Communications of the ACM - Artificial Intelligence 2024年12月24日
Images Give Robots a Sharper Focus
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了利用视觉数据训练机器人以模拟人类行为的新方法。传统的HITL和LLM训练耗时耗力且存在性能差距,而基于图像数据的训练则提供了一种更高效、可扩展的解决方案。研究人员利用图像生成模型和视频数据,使机器人能够理解人类如何执行各种任务,例如烹饪和叠衣服。通过这种方式,机器人可以学习更广泛的知识,并在不同环境中适应,为机器人技术的发展开辟了新的可能性。这种方法不仅能提升机器人在现实世界中的表现,还能让它们在不熟悉的环境中更好地执行任务。

🤖 传统机器人训练方法依赖HITL和LLM,耗时耗力且存在性能瓶颈,难以应对所有情况。例如,LLM擅长指导机器人进行高层活动,但在精细的运动和感知动作方面表现不足。

🖼️ 研究人员转向使用视觉数据训练机器人,利用图像和视频数据让机器人学习人类执行任务的方式,例如做饭或叠衣服。这种方法能提供更丰富的数据,并允许机器人更好地理解任务的细微之处。

🎬 通过视频和图像数据训练,机器人可以学习模仿人类的动作。例如,研究人员利用Stable Diffusion等AI图像生成平台,通过引入颜色球来引导机器人关节的运动,并将其转化为实际的机器人动作。

🚀 扩散模型使机器人具备了更广泛的知识,使其能够适应不同的环境和情况。例如,在制造业中,这种方法可以帮助训练系统处理复杂的装配过程;在家庭中,视觉数据可以帮助机器人学习如何清洁、整理空间和准备食物。

Teaching robots to navigate the world the way humans do is a formidable challenge. Over time, it has become clear that Human-in-the-loop (HITL) and large language model (LLM) training methods require enormous time and resources but leave significant performance gaps.

It is nearly impossible to prepare a robot for every situation using HITL or an LLM. For example, “A large language model is useful for directing a robot to pursue a high-level activity like cooking food or folding laundry, but it isn’t particularly good with fine grain motor and sensory actions you need to complete the task,” said Mohit Shridhar, a robotic research scientist with Google DeepMind.

For this reason, Shridhar and other researchers are taking an entirely different tack: they’re turning to image data to train robots. The idea is to imbue robots with a more complete understanding of how humans approach various tasks—whether it’s preparing scrambled eggs or folding a stack of laundry.

This research could fundamentally change the way mechanical arms, humanoid robots, and other devices motor through activities. “Conventional data required to train robots is limited and expensive—while visual images are plentiful and highly effective. It’s a way to scale up learning,” said Ruoshi Liu, a fourth-year Ph.D. computer science student who has explored the topic at Columbia University.

Added Anirudha Majumdar, an associate professor in the Mechanical & Aerospace Engineering Department at Princeton University, “Image generation and video models represent a huge opportunity for robotics.”

A Robotic Vision

Unlike fields like computer vision and natural language processing, robotics faces a data scarcity problem. Oftentimes, researchers rely on limited and carefully curated datasets. Much of the learning takes place in a lab, where a robot is connected to a human who performs a task. Software captures the motion, which is incorporated into the robot’s programming.

More recently, researchers have also turned to LLMs. However, by training a robot with videos or simulations, researchers can generate synthetic training data that mimics real-world actions and behaviors—without the overhead of HITL. Researchers typically fine-tune a pre-trained model like Stable Diffusion using methods similar to LLM training. Eventually, “The robot learns how to follow the trajectory of a task. It can understand how an arm, elbow, wrist, hand, and fingers work together,” Shridhar said.

In practical terms, a robot can receive a command—”cook scrambled eggs,” for instance—and proceed through all the steps required to prepare them. This includes retrieving eggs from the refrigerator, removing them from the carton, cracking them into a bowl, beating them, pouring them into a pan, cooking them, and then plating them.

For a human, the task doesn’t require much thinking. However, for a robot that lacks a complete understanding of the task, there are plenty of things that can go wrong. “It’s very difficult to explain to another person how to crack an egg,” Shridhar said. “So, while an LLM is good at directing the robot to do something, it falls short when you try to use it to handle the nitty-gritty details.”

Diffusion methods are appealing because they enable robots to gain generalized knowledge, making them adaptable across diverse environments and situations. For example, in manufacturing, the approach could help train a system to handle a complex assembly process. In the home, visual data could help a robot learn how to clean, organize spaces, and prepare meals.

Motion Matters

The same diffusion models that have revolutionized image editing are coming to robotics. Shridhar and a group of researchers from the former Stephen James Robot Learning Lab at the U.K.’s Imperial College London have developed a framework called Genima. Essentially, the system serves as a behavior-cloning agent by mapping sequences of movements from images and turning them into visuomotor controls.

The team turned to AI image generation platform Stable Diffusion—which broadly understands how objects appear in the real world—to “draw” actions for robots. They fed the model a series of images and introduced color spheres to indicate where specific joints of the robot should be, one second into the future. An ACT-based controller mapped the spheres and translated the motions into real-world robotic movements.

Researchers tested visual data for nine tasks spread across 25 simulations. The system achieved a success rate as high as 79.3% for a specific task, though the average rate hovered between 50% and 64%. The takeaway, Shridhar said, is that the approach allows robots to adapt to new objects by exploiting the prior knowledge of the Internet pre-trained model. It also emphasizes the potential of this method to create interactive agents that can take actions in the physical world.

Meanwhile, researchers from Columbia University and several other groups have devised a system called Dreamitate. “There are numerous videos of humans folding clothes and doing other chores,” Liu said. “It’s possible to use these videos to train a robot how to imitate humans.” Added Junbang Liang, a second-year master’s student at Columbia who participated in the research: “High-resolution video creates a shortcut for training robots.”

The researchers fed 300 video clips depicting specific tasks into a pre-trained diffusion model. The resulting motion data—collected from diverse environments—allowed an AI model to use an initial image to predict actions three seconds into the future—even in unfamiliar settings. Using object tracking software, the robot could then execute the assigned task.

The visuomotor policy learning framework achieved improvements that ranged from about 55% to 640% over more conventional training methods. “The robot’s ability to generalize activity boosts its performance,” Liu said. “What’s more, the research showed that the method could work with a real robot in the real world.”

Framing Progress

For now, the idea of using visual data to train robots remains in its infancy—and the field is currently limited by tools, GPU constraints and other issues. One significant challenge, Majumdar said, is that images aren’t always reliable. “Video and image models are trained on internet data, which often looks very different from visual observations that the robot collects using its sensors,” he said.

Noisy data is also a problem. This includes irrelevant objects and backgrounds that can make it difficult for the AI system to focus on the crucial visual data and extract the essential information. As a result, Majumdar has developed a video editing software tool called Bring your Own VLA, which removes irrelevant visual regions from the robot’s observation. Other research groups are also exploring ways to improve training methods, including TinyVLA, ViLa and ReVLA.

Nevertheless, it’s clear that video diffusion methods represent a promising avenue for developing smarter and more versatile robots. Combined with LLMs and HITL training, it’s possible to achieve the best of all worlds: simpler interactions with humans along with more human-like performance.

Concluded Liu: “The goal is to build a future prediction model that gives the robot enough information that it can handle tasks, even in unfamiliar situations and circumstances. That’s a key to building a robot that will function well in the real world.”

Samuel Greengard is an author and journalist based in West Linn, OR, USA.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器人 视觉数据 机器学习 扩散模型 人工智能
相关文章