MarkTechPost@AI 2024年10月13日
Stochastic Prompt Construction for Effective In-Context Reinforcement Learning in Large Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

这篇论文探讨了大型语言模型(LLMs)在无需参数更新的情况下,通过上下文强化学习(ICRL)进行学习和改进性能的可能性。研究人员提出了两种 ICRL 方法:Naive ICRL 和 Explorative ICRL。Naive ICRL 采用直接的方法,将新的示例、预测输出和奖励存储在缓冲区中,用于构建未来预测的上下文。然而,这种方法由于无法有效地探索输出空间而失败。Explorative ICRL 通过引入随机性和关注正向强化来解决这些局限性。它随机选择过去的示例以包含在提示中,利用 LLMs 对提示构成的敏感性,并只将具有正向奖励的示例包含在上下文中。该算法使用一个由 pkeep 参数化的伯努利变量来确定要包含哪些过去的示例,为每个输入创建独特的推理。为了管理上下文窗口饱和,Explorative ICRL 采用三种下采样策略:无偏随机移除、开始偏置前缀选择和结束偏置后缀选择。研究表明,Explorative ICRL 能够有效地从奖励中进行上下文学习,在各种任务和模型上都比零样本性能有显著提升。

🤔 **ICRL 的挑战:** 将基于输入-输出对的 ICL 方法应用于涉及输入-输出-奖励三元组的 RL 框架,需要克服 prompt 构建和模型适应的独特困难。

💡 **Explorative ICRL 的创新:** 通过引入随机性到 prompt 构建中,利用 LLMs 对 prompt 构成的敏感性来解决探索问题;通过从上下文中过滤掉负面示例,简化学习过程,使其更类似于传统的上下文学习格式。

📊 **Explorative ICRL 的优势:** 在各种分类任务中取得了令人印象深刻的结果,与零样本精度相比,模型性能得到了显著提升。例如,在 Banking77 分类任务中,Llama 的准确率从 17.2% 提高到 66.0%。

📈 **Explorative ICRL 的持续改进:** 随着时间的推移,其性能持续增长,尤其是在具有大量标签的更具挑战性的数据集上。在某些情况下,其准确率接近监督上下文学习,突出了它作为一种强大学习技术的潜力。

📉 **Naive ICRL 的局限性:** 由于无法有效地探索输出空间,导致其性能下降。

🚀 **Explorative ICRL 的未来潜力:** 它为 AI 系统提供了一种通用的方法,可以增强其自适应学习能力,即使在没有传统参数更新的情况下也能从环境中学习和适应。

Large language models (LLMs) have demonstrated impressive capabilities in in-context learning (ICL), a form of supervised learning that doesn’t require parameter updates. However, researchers are now exploring whether this ability extends to reinforcement learning (RL), introducing the concept of in-context reinforcement learning (ICRL). The challenge lies in adapting the ICL approach, which relies on input-output pairs, to an RL framework that involves input-output-reward triplets. This shift from a static dataset to a dynamic, online learning scenario presents unique difficulties in prompt construction and model adaptation. The key problem is to determine if LLMs can effectively learn and improve their performance through ICRL, potentially opening new avenues for AI systems to adapt and learn from their environment without traditional parameter updates.

Existing attempts to explore ICL have primarily focused on supervised learning scenarios. Researchers have extensively studied the underlying mechanisms and effectiveness of ICL, demonstrating that LLMs can learn new tasks within their context window. However, these efforts have been limited to supervised learning, leaving the potential for ICRL largely unexplored.

Recent advancements in extending LLMs’ context window lengths have enabled studies involving hundreds or thousands of demonstrations, showing continued performance improvements. While some research suggests that models can learn from mistakes, this finding needs to be universally supported and may require explicit reasoning about errors.

In reinforcement learning, previous work has investigated LLMs’ ability to solve multi-armed bandit problems in a simplified RL setting. These studies encountered challenges with naive approaches and highlighted LLMs’ difficulties with exploration. However, they were limited to simple scenarios and did not address more complex contextual bandit problems or general RL tasks.

Researchers from Cornell University, EPFL, and Harvard University proposed a unique method for ICRL that addresses the limitations of naive approaches by introducing two key innovations. First, it tackles the exploration problem by incorporating stochasticity into prompt construction, utilizing LLMs’ sensitivity to prompt composition. Second, it simplifies the learning process by filtering out negative examples from the context, making the prompt more similar to traditional in-context learning formats.

This approach effectively prevents degeneration in experiments and enables LLMs to perform ICRL successfully. The method demonstrates a strong correlation between performance and computational resources, allowing for flexible trade-offs between accuracy and efficiency. To mitigate the increasing computational costs associated with observing more examples, the researchers developed an approximation technique that maintains performance while reducing resource requirements.

The proposed ICRL method has shown impressive results across various classification tasks, significantly improving model performance compared to zero-shot accuracy. For instance, on the Banking77 classification task, Llama’s accuracy increased from 17.2% to 66.0% through ICRL. The approach has proven effective with different LLM architectures, showcasing its potential as a versatile technique for enhancing AI systems’ adaptive learning capabilities.

This method introduces two key approaches for ICRL: Naive ICRL and Explorative ICRL. Naive ICRL follows a straightforward implementation where the model observes new examples, predicts outputs, and receives rewards. These episodes are stored in a buffer and used to construct the context for future predictions. However, this approach fails due to its inability to explore the output space effectively.

Explorative ICRL addresses these limitations by introducing stochasticity and focusing on positive reinforcement. It randomly selects past episodes to include in the prompt, utilizing LLMs’ sensitivity to prompt composition. This method only includes episodes with positive rewards in the context, simplifying the learning process. The algorithm uses a Bernoulli variable parameterized by pkeep to determine which past episodes to include, creating unique reasoning for each input.

To manage context window saturation, Explorative ICRL employs three downsampling strategies: unbiased random removal, start-biased prefix selection, and end-biased suffix selection. While this approach effectively introduces exploration and improves performance, it comes at a higher computational cost due to the need for fresh context construction for each input, limiting the benefits of caching used in the Naive approach.

The results demonstrate that LLMs can effectively learn in context from rewards alone using the Explorative ICRL method. This approach shows significant improvements over zero-shot performance across various tasks and models. For instance, Explorative ICRL improved Llama’s accuracy by 48.8% on Banking-77 and 56.8% on Clinic-150, with similar gains observed for the Phi model.

Explorative ICRL consistently outperforms zero-shot baselines and shows continual growth in performance over time, especially on more challenging datasets with numerous labels. In some settings, its accuracy approaches that of supervised in-context learning, highlighting its potential as a powerful learning technique.

In contrast, the Naive ICRL approach fails to learn and often performs worse than zero-shot due to its inability to explore effectively. Visualization of prediction confusion matrices and output distributions clearly illustrates Explorative ICRL’s superior exploration capabilities compared to the Naive approach.

Further analysis reveals that both key modifications in Explorative ICRL – stochasticity for exploration and focusing on positive reward episodes – contribute significantly to its success. The method shows some robustness to noisy rewards, maintaining performance even with a 10% probability of inverted rewards.

This research demonstrates the potential of LLMs to perform ICRL. The study introduces three algorithms: Naive, Explorative, and Approximate ICRL. While the Naive approach fails due to poor exploration, the Explorative method successfully introduces stochasticity in prompt construction and focuses on positive examples, leading to consistent ICRL performance. The Approximate method addresses the high computational costs of Explorative ICRL by providing a trade-off between efficiency and robustness.

The study’s findings highlight the importance of exploration in ICRL and the effectiveness of stochastic prompt construction. However, the researchers acknowledge several limitations and areas for future work. These include the need to investigate ICRL in more complex problem domains beyond classification, exploring the use of nuanced reward signals beyond binary feedback, and addressing the challenge of reasoning about episodes with negative rewards.

In addition to this, the computational intensity of the proposed methods, especially as the number of observed episodes increases, presents an ongoing challenge. While the Approximate method offers a partial solution, questions remain about optimizing ICRL for limited context windows and extended interactions. These limitations outline crucial directions for future research to advance the field of in-context reinforcement learning with LLMs.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post Stochastic Prompt Construction for Effective In-Context Reinforcement Learning in Large Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

上下文强化学习 大型语言模型 随机提示构建 AI 适应学习
相关文章