MarkTechPost@AI 2024年07月11日
Google DeepMind Introduces a Parameter-Efficient Expert Retrieval Mechanism that Leverages the Product Key Technique for Sparse Retrieval from a Million Tiny Experts
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google DeepMind 提出了一种名为参数高效专家检索(PEER)的新方法,它利用产品键技术从一百万个微型专家中进行稀疏检索,从而提高了 Mixture-of-Experts(MoE)模型的粒度,实现了更好的性能-计算权衡。PEER 通过学习索引结构来进行路由,从而实现高效且可扩展的专家检索,使计算成本与参数数量分离,在语言建模任务中表现出显著的效率和性能提升。

🤔 PEER 是一种新颖的专家检索机制,它利用产品键技术从一百万个微型专家中进行稀疏检索,有效地解决了现有 MoE 模型在扩展到大量专家时面临的计算和优化挑战。

🚀 PEER 通过学习索引结构来进行路由,从而实现高效且可扩展的专家检索,这与传统的 MoE 模型相比,将计算成本与参数数量分离,并实现了更好的性能-计算权衡。

📊 实验结果表明,PEER 层在性能-计算权衡方面明显优于密集 FFW 和粗粒度 MoE。在 C4 数据集上,PEER 模型在 2e19 FLOP 预算下实现了 16.34 的困惑度,而密集模型为 17.70,MoE 模型为 16.88,这表明 PEER 在提高 Transformer 模型的可扩展性和性能方面具有巨大潜力。

💡 PEER 的高效性使其能够处理大量数据流,使其成为终身学习和其他需要大量计算资源的 AI 应用的理想解决方案。

🌐 PEER 的应用场景包括语言建模、自然语言处理等需要进行大规模模型训练的任务,它能够有效地提高模型的效率和性能,从而推动 AI 研究的进步。

In transformer architectures, the computational costs and activation memory grow linearly with the increase in the hidden layer width of feedforward (FFW) layers. This scaling issue poses a significant challenge, especially as models become larger and more complex. Overcoming this challenge is essential for advancing AI research, as it directly impacts the feasibility of deploying large-scale models in real-world applications, such as language modeling and natural language processing tasks.

Current methods addressing this challenge utilize Mixture-of-Experts (MoE) architectures, which deploy sparsely activated expert modules instead of a single dense FFW layer. This approach allows model size to be decoupled from computational cost. Despite the promise of MoEs, as demonstrated by researchers like Shazeer et al. (2017) and Lepikhin et al. (2020), these models face computational and optimization challenges when scaling beyond a small number of experts. The efficiency gains often plateau with increasing model size due to a fixed number of training tokens. These limitations prevent the full potential of MoEs from being realized, especially in tasks requiring extensive and continual learning.

The Researchers from Google DeepMind propose a novel approach called Parameter Efficient Expert Retrieval (PEER), which specifically addresses the limitations of existing MoE models. PEER leverages the product key technique for sparse retrieval from a vast pool of tiny experts, numbering over a million. This approach enhances the granularity of MoE models, resulting in a better performance-compute trade-off. The innovation lies in the use of a learned index structure for routing, enabling efficient and scalable expert retrieval. This method decouples computational cost from parameter count, representing a significant advancement over previous architectures. PEER layers demonstrate substantial improvements in efficiency and performance for language modeling tasks.

The PEER layer operates by mapping an input vector to a query vector, which is then compared with a set of product keys to retrieve the top k experts. These experts are single-neuron multi-layer perceptrons (MLPs) that contribute to the final output through a weighted combination based on router scores. The product key retrieval technique reduces the complexity of expert retrieval, making it feasible to handle over a million experts efficiently. The dataset used for experiments is the C4 dataset, with isoFLOP analysis conducted to compare PEER with dense FFW, coarse-grained MoEs, and Product Key Memory (PKM) layers. The experiments involved varying the model size and the number of training tokens to identify compute-optimal configurations.

The results show that PEER layers significantly outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. When applied to several language modeling datasets, including the Curation Corpus, Lambada, the Pile, Wikitext, and C4, the PEER models achieved notably lower perplexity scores. For instance, with a FLOP budget of 2e19, PEER models reached a perplexity of 16.34 on the C4 dataset, which is lower compared to 17.70 for dense models and 16.88 for MoE models. These findings highlight the efficiency and effectiveness of the PEER architecture in enhancing the scalability and performance of transformer models.

In conclusion, this proposed method represents a significant contribution to AI research by introducing the PEER architecture. This novel approach addresses the computational challenges associated with scaling transformer models by leveraging a vast number of tiny experts and efficient routing techniques. The PEER model’s superior performance-compute trade-off, demonstrated through extensive experiments, highlights its potential to advance AI research by enabling more efficient and powerful language models. The findings suggest that PEER can effectively scale to handle extensive and continuous data streams, making it a promising solution for lifelong learning and other demanding AI applications.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Google DeepMind Introduces a Parameter-Efficient Expert Retrieval Mechanism that Leverages the Product Key Technique for Sparse Retrieval from a Million Tiny Experts appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

参数高效专家检索 PEER MoE 产品键技术 稀疏检索
相关文章