MarkTechPost@AI 2024年10月03日
Researchers from KAIST and Google AI Introduce Blockwise Parallel Decoding (BCD): An AI Method for Rescoring Algorithms for Improved Efficiency and Fluency in Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

KAIST和谷歌的研究人员开发了Blockwise Parallel Decoding(BPD)方法,以加速语言模型的推理。该方法可同时预测多个未来令牌,虽提高了推理速度,但也存在确保块草稿精确和有序的困难。团队通过分析令牌分布和创建改进算法来提高块草稿的有效性,测试结果显示块效率提高了5 - 21%。

🧩Blockwise Parallel Decoding(BPD)允许同时预测多个未来令牌,多个预测头并行构建块草稿,自回归模型随后选择并有条件地接受最适合的令牌,此技术通过减少等待顺序令牌预测的时间来大大加快推理速度。

🚧BPD存在确保块草稿精确和有序的困难,团队通过检查BPD中多个预测头生成的令牌分布,以更好地理解模型如何同时生成多个令牌,并优化这些预测以提高流畅性和准确性。

🎯为提高块草稿的质量,团队提出使用神经语言模型和n - gram模型的算法,前者提供更复杂的上下文意识,后者有助于保证令牌预测的局部一致性。

🎉研究的测试取得了令人鼓舞的结果,改进的块草稿使块效率提高了5 - 21%,且在多个不同的数据集上均有体现,表明该方法具有弹性。

Recent advances in autoregressive language models have brought about an amazing transformation in the field of Natural Language Processing (NLP). These models, such as GPT and others, have exhibited excellent performance in text creation tasks, including question-answering and summarization. However, their high inference latency poses a significant barrier to their general application, particularly in highly deep models with hundreds of billions of parameters. This lag results from their nature because autoregressive models generate text one token at a time in a series. This leads to a significant increase in computing demand, which restricts the models’ ability to be deployed in real time.

To address this problem, a team of researchers from KAIST and Google has developed Blockwise Parallel Decoding (BPD), a method designed to speed up the inference of these models. Known as block drafts, BPD permits the simultaneous prediction of several future tokens, in contrast to typical autoregressive methods. Multiple prediction heads construct these block drafts in parallel, and the autoregressive model then selects and conditionally accepts the best-fit tokens.

Because several tokens are presented simultaneously, this technique greatly accelerates inference speed by decreasing the amount of time spent waiting for sequential token predictions. But BPD comes with its own set of difficulties, especially in making sure the block drafts are precise and well-organized enough for the model to accept them.

The team has shared two key ways by which the effectiveness of the block drafts has been advanced. The token distributions generated by the several prediction heads in BPD have been first examined. The goal of this analysis is to better understand how the model simultaneously generates several tokens and how to optimize these predictions for increased fluency and accuracy. Through the analysis of these token distributions, trends or irregularities that could impair block draft performance can be spotted.

Second, using this research, the study creates algorithms that improve the block drafts. The team has specifically suggested employing neural language models and n-gram models to enhance the block drafts’ quality prior to the autoregressive model’s verification. While neural language models provide more sophisticated context awareness, which helps to make block drafts more in line with the model’s expectations, n-gram models help guarantee local consistency in token predictions.

The study’s testing yielded encouraging results, with improved block drafts increasing block efficiency, which is a measure of how many tokens from the block draft are eventually accepted by the autoregressive model by 5-21%. These gains were shown on several different datasets, indicating the method’s resilience. 

The team has summarized their primary contributions as follows. 

    The study looks into how prediction heads behave in blockwise parallel language models (BPD), finding evidence of falling confidence in predictions for later tokens and significant consecutive token repetition (20% to 75%). This draws attention to poor block draft quality.
    The team has proposed the notion of Oracle top-k block efficiency. They demonstrate that block efficiency can be greatly increased by lowering repetition and uncertainty and taking into account the top-k most likely tokens for each head.
    Two algorithms have been introduced – Global rescoring using n-gram models, which efficiently rescore many candidate drafts, and Local rescoring using neural LMs, which refines block drafts for fluency and coherence. These techniques maximize resource utilization while increasing block efficiency by up to 21.3%.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

Want to get in front of 1 Million+ AI Readers? Work with us here

The post Researchers from KAIST and Google AI Introduce Blockwise Parallel Decoding (BCD): An AI Method for Rescoring Algorithms for Improved Efficiency and Fluency in Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Blockwise Parallel Decoding 语言模型 推理加速 块效率
相关文章