Communications of the ACM - Artificial Intelligence 02月21日
Putting the Smarts into Robot Bodies
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了为具身人工智能(EAI)系统构建基础模型的关键权衡。EAI旨在将人工智能嵌入到机器人等实体中,使其具备感知、学习和与环境动态交互的能力。文章对比了两种主要方法:预训练大模型(Pretraining + Fine-Tuning + ICL)和元训练结合广义上下文学习(Meta-Training + GPICL)。前者零样本能力强,但泛化性受限;后者更注重学习能力和泛化性,但资源需求高。文章还介绍了Infini-attention机制和StreamingLLM框架等创新技术,旨在克服计算和内存瓶颈,使EAI系统能够在资源有限的边缘设备上实现高效学习和推理。

💡EAI系统应具备持续适应环境的进化学习机制,而非依赖预定义的复杂逻辑,环境显著影响物理行为和认知结构。

📚预训练大模型(Pretraining + Fine-Tuning + ICL)的优势在于零样本能力强,泛化性在分布内任务表现良好,但对分布外任务能力有限。其通过扩大参数和预训练数据集来增强可扩展性,侧重于世界知识和硬件理解。

🧠元训练结合广义上下文学习(Meta-Training + GPICL)侧重于学习能力、记忆和抽象,通过扩展元训练任务、上下文长度、记忆和隐藏状态来提高可扩展性。该方法能够通过上下文持续学习,避免灾难性遗忘,更适合EAI系统。

🚀Infini-attention机制通过集成掩码局部注意力和长期线性注意力,在单个Transformer块中实现高效处理短程和远程上下文依赖,同时压缩记忆系统以有界存储和计算成本维护和检索信息。

⏱️StreamingLLM框架通过保留初始token的KV状态作为注意力sink,并结合最近的token,使经过有限注意力窗口训练的大模型能够泛化到无限序列长度,无需微调,显著加速长文本建模。

Fan Wang and Shaoshan Liu
Building Foundation Models for Embodied Artificial Intelligence
https://bit.ly/3Wn2FY5
July 15, 2024

Embodied Artificial Intelligence (EAI) involves embedding artificial intelligence into tangible entities, such as robots, equipping them with the capacity to perceive, learn from, and engage dynamically with their surroundings. In this article we delve into the key tradeoffs of building foundation models for EAI systems.

Foundation Models for Embodied AI

Previously, we have outlined three guiding principles for developing embodied artificial intelligence (EAI) systems.1 EAI systems should not depend on predefined, complex logic to handle specific scenarios. Instead, they must incorporate evolutionary learning mechanisms, enabling continuous adaptation to their operational environments. Additionally, the environment significantly influences not only physical behaviors but also cognitive structures. While the third principle focuses on simulation, the first two principles emphasize building EAI foundation models capable of learning from the EAI systems’ operating environments.

A common approach for EAI foundation models is to directly utilize pretrained large models. For example, pretrained GPT models can serve as a baseline, followed by fine-tuning and in-context learning (ICL) to enhance performance.9 These large models typically possess a substantial number of parameters to encode extensive world knowledge and feature a small context window for fast response times. This extensive pre-encoding allows these models to deliver excellent zero-shot performance. However, their limited context windows pose challenges for continuous learning from the EAI systems’ operating environments and connecting various usage scenarios.

Alternatively, another approach leverages models with significantly fewer parameters but a larger context window. These models, rather than encoding comprehensive world knowledge, focus on learning how to learn, or meta-learning.2 With large context windows, these models can perform general-purpose in-context learning (GPICL), enabling continuous learning from their operating environments and establishing connections across a broad context.

The Figure below illustrates these two different approaches. The meta-training + GPICL approach, while exhibiting poorer zero-shot performance and having a smaller model size, excels in continuously learning from its environment, eventually specializing EAI systems for specific tasks. In contrast, the pretraining + fine-tuning + ICL approach, characterized by a larger model size and smaller context windows, offers superior zero-shot performance but inferior learning capabilities.

Figure.  Foundation Model Options for EAI.

Empirical evidence supporting this is found in the GPT-3 paper, where a 7B Few-Shot model outperforms a 175B Zero-Shot model.3 If few-shot learning is replaced by a long context window enabling EAI systems to learn from their operating environments, performance may further improve.

We envision an ideal foundation model for EAI that should meet several critical criteria. Firstly, it should be capable of universally learning from complex instructions, demonstrations, and feedback without relying on crafted optimization techniques. Secondly, it should demonstrate high sample efficiency in its learning and adaptation processes. Thirdly, it must possess the ability to continuously learn through contextual information, effectively avoiding the issue of catastrophic forgetting. Therefore, we conclude that the meta-learning + GPICL approach is suitable for EAI systems. However, before we decide on taking this approach, let us first examine the tradeoffs between these two approaches.

Key Tradeoffs

In this section, we review the tradeoffs between pretrained large models vs. meta-training + GPICL as foundation models for EAI.4 The results are summarized in the Table below.

Table. 
Tradeoffs of Pretrained large model vs. meta-training + GPICL
ComparisonPretraining + Fine-Tuning + ICLMeta-Training + GPICL
Zero-Shot CapabilityHighLow
GeneralizabilityIn-Distribution TasksDiverse and Complex
Rudimentary Out-of-Distribution TasksOut-Of-Distribution Tasks
Knowledge carrierParametersMemory / Hidden States
Scalability Enhancement ApproachScaling up parameters and pre-training datasetsScaling up meta-training tasks, context length, memories, and hidden states
Methodology of Task AdaptationData Collection (Fine-Tuning, In-efficient)Very Complex Instruction
Rudimentary Instruction & Prompt (ICL)Explore & Exploit automatically
Emphasis of pre-training / meta-training stageWorld knowledge, knowledge regarding the hardwareThe capability of learning, memorization, and abstraction
Emphasis of post-training stageHuman-alignment, task-specific knowledgeWorld knowledge, human-alignment, task-specific knowledge
Inference LatencyLowHigh
Memory SizeSmallLarge
Credit: Fan Wang

For zero-shot capability, the Pretraining + Fine-Tuning + ICL approach9 offers high performance, allowing models to generalize well to new tasks without any task-specific fine-tuning. In contrast, the Meta-Training + GPICL approach exhibits low zero-shot capability, as it focuses on learning to adapt to a wide variety of tasks using in-context learning rather than zero-shot generalization.

In terms of generalizability, the Pretraining + Fine-Tuning + ICL approach performs well on in-distribution tasks but has rudimentary capabilities for out-of-distribution tasks. Meta-Training + GPICL, on the other hand, exhibits diverse and complex generalization capabilities for out-of-distribution tasks due to its emphasis on meta-training over varied contexts.

The scalability enhancement approach for Pretraining + Fine-Tuning + ICL involves scaling up parameters and pre-training datasets to improve performance. Meta-Training + GPICL enhances scalability by scaling up meta-training tasks, context length, memories, and hidden states to improve the model’s adaptability.

Regarding task adaptation, Pretraining + Fine-Tuning + ICL relies on data collection and fine-tuning, which can be inefficient. In contrast, Meta-Training + GPICL utilizes very complex instructions and learns from diverse contexts automatically.

During the pre-training or meta-training stage, Pretraining + Fine-Tuning + ICL focuses on world knowledge and understanding the hardware. Meta-Training + GPICL emphasizes the capability of learning, memorization, and abstraction over a wide variety of tasks.

In the post-training stage, Pretraining + Fine-Tuning + ICL involves aligning the model to specific human-centric tasks, emphasizing human-alignment and task-specific knowledge. Meta-Training + GPICL continues to emphasize world knowledge, human-alignment, and task-specific knowledge.

Inference latency is generally low for Pretraining + Fine-Tuning + ICL as the model parameters are fixed after training. However, for Meta-Training + GPICL, inference can be slower due to the need to utilize and update memory and hidden states dynamically.

Memory size requirements for Pretraining + Fine-Tuning + ICL are small, as most knowledge is embedded in fixed model parameters. Conversely, Meta-Training + GPICL requires significant memory to handle complex instructions, extended context, and hidden states.

Meta-Training + GPICL offers the advantage of enabling the system to continuously learn various tasks through contexts, i.e., learning to continuously learn.7 This essentially requires the system to be able to learn new tasks without forgetting the old ones, which typically poses great challenge for gradient-based fine-tuning (catastrophic forgetting8) but can be less of a challenge with in-context learning.

Overcoming the Computing and Memory Bottlenecks

From the above comparison, it is evident that meta-training combined with GPICL offers superior adaptability and generalization across diverse and complex tasks. However, this approach demands higher resources, posing a challenge for most EAI systems, which are often real-time edge devices with limited computational capabilities and memory. The large context windows required for this approach can significantly increase inference time and memory footprint, potentially hindering its feasibility for EAI foundation models.

Fortunately, recent advancements have introduced innovative solutions to scale Transformer-based Large Language Models (LLMs) for processing infinitely long inputs while maintaining bounded memory and computational efficiency. A notable innovation is the Infini-attention mechanism, which integrates masked local attention and long-term linear attention within a single Transformer block. This enables the efficient processing of both short and long-range contextual dependencies. Additionally, the compressive memory system allows the model to maintain and retrieve information with bounded storage and computation costs, reusing old Key-Value (KV) states to enhance memory efficiency and enable fast streaming inference. Experimental results demonstrate that the Infini-attention model outperforms baseline models in long-context language modeling benchmarks, showing superior performance in tasks involving extremely long input sequences (up to 1 million tokens) and significant improvements in memory efficiency and perplexity scores.

Similarly, the StreamingLLM framework enables large models trained with a finite attention window to generalize to infinite sequence lengths without the need for fine-tuning. This is achieved by preserving the Key and Value (KV) states of initial tokens as attention sinks, along with the most recent tokens, stabilizing attention computation and maintaining performance over extended texts. StreamingLLM excels at modeling texts up to 4 million tokens, providing a remarkable speedup of up to 22.2 times.

Conclusion

In conclusion, we believe that learning from the environment is the essential feature for EAI systems and thus the meta-training + GPICL approach is promising for building EAI foundation models due to its capabilities of providing better long-term adaptability and generalization. Although currently this approach is facing significant challenges in computing and memory usage, we believe that innovations such as Infini-attention and StreamingLLM will soon making this approach viable for real-time, resource-constrained environments.

 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

具身智能 EAI 基础模型 元学习 Infini-attention StreamingLLM
相关文章