MarkTechPost@AI 04月21日 10:35
LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing Latency
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Letta和加州大学伯克利分校的研究人员提出的“sleep-time compute”方法,旨在提高大型语言模型(LLM)的效率。该方法通过在用户提问前的空闲时间预先处理上下文信息,从而减少推理成本和延迟,同时提高准确性。研究结果显示,该方法在多个数据集上表现出色,例如在GSM-Symbolic和AIME数据集上分别提高了13%和18%的准确率,并将测试时计算需求降低了约5倍。尤其是在处理多个相关查询时,该方法展现出显著的成本优势,为构建更智能、更经济高效的LLM应用提供了新思路。

🧠 Sleep-time compute的核心在于利用用户交互之间的空闲时间,预先对上下文信息进行处理,构建一个增强版的上下文(c’)。

📈 研究表明,与传统的测试时计算方法相比,sleep-time compute在多个数据集上显著提升了性能。例如,在GSM-Symbolic和AIME数据集上,准确率分别提升了13%和18%。

💰 在处理多个相关查询时,sleep-time compute展现出显著的成本优势。当10个查询共享同一上下文时,平均查询成本降低了2.5倍。

💡 实验结果显示,sleep-time compute在预测性较强的查询上效果更佳。研究人员通过对查询的预测性进行评分,发现预测性越高的查询,sleep-time compute带来的收益越大。

Large language models (LLMs) have gained prominence for their ability to handle complex reasoning tasks, transforming applications from chatbots to code-generation tools. These models are known to benefit significantly from scaling their computation during inference, often producing higher accuracy by dedicating more resources to hard problems. However, this approach brings along considerable drawbacks. Longer processing times and higher computing costs make it challenging to scale such solutions in real-world settings, where responsiveness and affordability are crucial. As technology advances toward more intelligent systems, there is a growing need to explore how LLMs can become not only smarter but also more efficient, especially when operating within repetitive or familiar contexts.

One of the biggest inefficiencies in current LLM deployment occurs during query resolution. Typically, when a user poses a question, the model processes it simultaneously with the necessary background context. This test-time compute assumes that the context and question always arrive together. But in real scenarios, such as document Q&A or debugging code, context is usually persistent and can be accessed well before a specific question is asked. Yet, the model processes everything from scratch for each query, even if it has seen the context before. This redundancy results in increased computational costs and response delays, particularly in scenarios involving multiple queries within a single context.

To deal with this inefficiency, various methods have been developed. Sequential and parallel test-time computation are two major strategies. Sequential approaches extend the model’s reasoning path, allowing it to consider more possibilities, while parallel approaches involve sampling multiple outputs simultaneously, known as pass@k. Techniques like speculative decoding aim to cut latency by making early guesses, but their usefulness is limited when the model still has to think from scratch. While helpful, these methods don’t eliminate the need to process context alongside every new question repeatedly. They also typically require test-time conditions that aren’t always feasible, such as access to an oracle or an ideal verifier.

Researchers from Letta and the University of California, Berkeley, introduced a novel solution they call sleep-time compute. The method involves utilizing idle time between user interactions to increase productivity. Instead of waiting for a user question, the model begins analyzing the context beforehand. It anticipates possible future queries and builds a new version of the context enriched with relevant inferences. When a user finally asks a question, the model can simply refer to this pre-processed context. Since much of the thinking is already done, it requires less computational effort to produce accurate answers. This approach becomes even more effective when multiple questions relate to the same context, allowing for shared inferences and distributed computational cost.

The implementation of sleep-time compute relies on decomposing the traditional prompt into two parts: a static context and a dynamic query. During the sleep-time window, only the context is used to generate a pre-processed version. This enhanced context, called c′, is built using test-time compute techniques like reasoning chains or summarization. Once this enriched version is stored, it replaces the raw context during real-time queries. The final answers are then generated using much fewer resources. This system not only minimizes redundant reasoning but also paves the way for more proactive LLMs that can think ahead and be better prepared.

To evaluate the effectiveness of sleep-time compute, the research team tested it using two specially designed benchmarks: Stateful GSM-Symbolic and Stateful AIME. Both datasets are derived by splitting existing problem sets into separate contexts and questions. In experiments using models like GPT-4o and GPT-4o-mini, researchers observed a 5× reduction in test-time compute for similar accuracy levels. Notably, accuracy improved by up to 13% for the GSM-Symbolic P2 dataset and by 18% on Stateful AIME when sleep-time compute was scaled. Multi-Query GSM-Symbolic, a new dataset introduced for this evaluation, helped demonstrate that the cost per query could be reduced by 2.5× when 10 queries shared the same context.

When pitted against popular strategies like pass@k, sleep-time compute consistently outperformed them. Unlike pass@k, which assumes access to a perfect evaluator, sleep-time compute works under more realistic conditions. Results show that even at low test-time compute budgets, sleep-time compute produced comparable or better accuracy while consuming fewer tokens. For instance, the GPT-4o-mini model achieved higher accuracy with fewer than 200 test-time tokens using sleep-time compute compared to over 500 tokens needed in the baseline. Even when models like Claude Sonnet 3.7 and DeepSeek R1 were evaluated, similar improvements were observed.

Scaling the amount of compute dedicated to sleep-time further improved outcomes. By running five parallel generations during sleep-time on complex tasks, researchers pushed the pareto curve further. However, they noted diminishing returns beyond this point. Importantly, results showed that stronger models handling more difficult tasks benefited more from additional sleep-time compute. Also, amortizing sleep-time computation became highly cost-effective when contexts served multiple related queries. By weighting test-time tokens as ten times more expensive than sleep-time tokens, aligned with industry latency-cost ratios, the researchers confirmed a reduction of up to 2.5 times in the average cost per query.

Another interesting finding was that sleep-time compute worked best when user queries were predictable. Using Llama2-70B, researchers scored the predictability of each query given its context and found a strong correlation: the more predictable the query, the greater the benefit. In examples where the question logically followed from the given context, sleep-time computation yielded higher gains. Conversely, less predictable or abstract queries experienced reduced effectiveness, although they still showed benefits compared to traditional test-time-only methods.

Altogether, this research presents a smart and scalable technique to enhance the efficiency of LLMs without compromising accuracy. By leveraging otherwise idle time, sleep-time computing reduces the burden on real-time systems, lowers operational costs, and improves response time. The clear quantitative improvements, such as a 5× reduction in compute, 13–18% accuracy gains, and a drop of up to 2.5× in cost per query, demonstrate that forward-thinking approaches like this could shape the next generation of intelligent, context-aware assistants.

Several Key Takeaways from the Research are as follows:


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing Latency appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM sleep-time compute 效率 推理 成本
相关文章