MarkTechPost@AI 07月29日 12:05
Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

亚马逊研究人员开发了一种创新的AI架构,通过仅激活与任务相关的神经元,将推理时间缩短了30%。该技术模仿人脑高效工作模式,动态、情境化地修剪模型,避免激活整个网络,从而显著降低计算成本和延迟。实验表明,此方法在保持模型质量的同时,大幅提升了多语言语音识别等任务的性能,并为AI的能效和可扩展性带来了新机遇,有望应用于自然语言处理和计算机视觉等多个领域。

🧠 动态情境化修剪:该AI架构的核心在于“动态、情境化修剪”,这是一种在推理过程中实时选择和激活最相关神经元或模块的技术,而非在训练阶段进行静态剪枝。通过引入一个轻量级的门控预测器,模型能够根据任务类型、语言等输入特征生成“掩码”,决定哪些神经元或模块在特定时刻开启,从而实现高效能。

⚡️ 模块化策略与硬件兼容性:为了提高硬件效率,该架构采用模块化剪枝策略,即跳过整个模块或层,而非单个权重。这种方法保留了模型的结构完整性,并确保了与GPU及现代硬件加速器的兼容性。门控预测器通过稀疏性损失进行训练,并使用Gumbel-Softmax等技术确保了训练过程的可微分性,最终在推理时实现清晰的二元神经元选择。

⏱️ 显著的性能提升与质量保证:实验结果显示,该架构能将多语言语音到文本任务的推理时间最多缩短34%,同时在保持较高稀疏度的情况下,FLOPs(浮点运算次数)减少超过60%,有效降低了云和硬件成本。更重要的是,即使在一定程度的稀疏化下,模型质量(如BLEU分数和词错误率)也得以维持,未出现明显性能下降。

🔀 任务与语言适应性:研究发现,最优的神经元激活策略会因任务和语言的不同而显著变化。例如,在自动语音识别(ASR)中,局部上下文模块至关重要,而解码器可以进行大量稀疏化。在语音翻译(ST)任务中,编码器和解码器则需要更均衡的注意力。这种适应性表明了模型内部学习到的专业化能力。

🚀 广阔的应用前景:这种动态模块化剪枝技术为构建更节能、可扩展的AI模型提供了可能,尤其对于不断增长的LLM和多模态模型而言意义重大。它还为AI模型实现个性化计算路径(基于用户、地区或设备)打开了大门,并有望迁移到自然语言处理、计算机视觉等更多应用领域。

Amazon researchers developed a new AI architecture that cuts inference time by 30% by selecting only task-relevant neurons, similar to how the brain uses specialized regions for specific tasks. This breakthrough approach addresses one of the biggest challenges facing large AI models: the computational expense and latency associated with activating every neuron for every request, regardless of their relevance.

The traditional deployment of large language models (LLMs) and foundational AI systems has relied on activating the full network for every input. While this guarantees versatility, it results in significant inefficiency—much of the network’s activity is superfluous for any given prompt. Inspired by the human brain’s efficiency—the brain flexibly recruits only the circuits it needs for a given cognitive task—Amazon’s architecture mimics this behavior by activating neurons most relevant to the current input context.

Dynamic, Context-Aware Pruning

At the heart of this innovation is dynamic, context-aware pruning. Rather than trimming the model statically during training and locking in those changes, Amazon’s solution prunes the network “on the fly,” during inference itself. This enables the model to remain large and versatile, yet efficient and fast-active for any specific task.

How the System Works

The architecture introduces a context-aware gating mechanism. This mechanism analyzes input features (and, for speech models, auxiliary information such as language and task tokens) to decide which modules—such as self-attention blocks, feed-forward networks, or specialized convolutions—are essential for the current step. For example, in a speech recognition task, it may activate local context modules for detailed sound analysis while skipping unnecessary components that are only beneficial for other tasks.

This pruning strategy is structured and modular: instead of removing individual weights (which can lead to hardware inefficiency), it skips entire modules or layers. This preserves the model’s structural integrity and ensures compatibility with GPU and modern hardware accelerators.

The gate predictor model is trained with a sparsity loss to achieve a target sparsity: the proportion of modules skipped. Training uses techniques like the Gumbel-Softmax estimator, ensuring that gating behavior remains differentiable during optimization, but ultimately yields crisp, binary neuron selection at inference.

Demonstrated Results: Speed Without Sacrificing Quality

Experiments show that dynamically skipping irrelevant modules can:

Task and Language Adaptation

A core insight is that optimal pruning strategies—meaning which modules to retain or skip—can change dramatically depending on the task and language. For instance:

Broader Implications

This dynamic, modular pruning opens the door for:

By selectively activating only task-relevant modules in real time, inspired by biological neural efficiency, Amazon’s architecture points the way toward AI that is both powerful and practical for global, real-world use.


Check out the Paper and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

亚马逊AI 推理效率 神经元激活 AI架构 动态修剪
相关文章