AWS Machine Learning Blog 6小时前
How Rufus doubled their inference speed and handled Prime Day traffic with AWS AI chips and parallel decoding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

亚马逊AI购物助手Rufus在Prime Day面临巨大挑战,需要在严格的延迟和吞吐量要求下处理海量查询。为了应对这一挑战,Rufus采用了基于AWS AI芯片的并行解码技术,显著提升了LLM的性能和效率。通过结合AWS Trainium和Inferentia芯片,Rufus实现了两倍的响应速度提升,降低了50%的推理成本,并在高峰流量期间实现了无缝扩展。这种创新方法不仅优化了用户体验,还为大规模LLM部署树立了新的标杆。

🚀 **应对Prime Day挑战**:Rufus在2024年Prime Day面临前所未有的工程挑战,需要每分钟处理数百万次查询,实时生成数十亿个token,同时保持300毫秒的延迟SLA,并最大限度地降低功耗。

💡 **并行解码加速**:Rufus采用了并行解码技术,打破了传统LLM文本生成的顺序依赖性,通过引入额外的解码头,并行预测多个token,从而显著提高生成效率。

⚙️ **AWS AI芯片加持**:Rufus结合了Inferentia2和Trainium AI芯片、Amazon EC2和Application Load Baler,并与NVIDIA合作,使用NVIDIA的Triton Inference Server,在AWS芯片上托管模型,加速并行解码的性能。

✅ **显著性能提升**:通过集成并行解码与AWS AI芯片和NxDI框架,Rufus的文本生成速度比自回归解码提高了一倍,推理成本降低了50%,并在Prime Day期间实现了无缝扩展。

Large language models (LLMs) have revolutionized the way we interact with technology, but their widespread adoption has been blocked by high inference latency, limited throughput, and high costs associated with text generation. These inefficiencies are particularly pronounced during high-demand events like Amazon Prime Day, where systems like Rufus—the Amazon AI-powered shopping assistant—must handle massive scale while adhering to strict latency and throughput requirements. Rufus is an AI-powered shopping assistant designed to help customers make informed purchasing decisions. Powered by LLMs, Rufus answers customer questions about a variety of shopping needs and products and simplifies the shopping experience, as shown in the following image.

Rufus relies on many components to deliver its customer experience including a foundation LLM (for response generation) and a query planner (QP) model for query classification and retrieval enhancement. The model parses customer questions to understand their intent, whether keyword-based or conversational natural language. QP is on the critical path for Rufus because Rufus cannot initiate token generation until QP provides its full output. Thus, reducing QP’s end-to-end text generation latency is a critical requirement for reducing the first chunk latency in Rufus, which refers to the time taken to generate and send the first response to a user request. Lowering this latency improves perceived responsiveness and overall user experience. This post focuses on how the QP model used draft centric speculative decoding (SD)—also called parallel decoding—with AWS AI chips to meet the demands of Prime Day. By combining parallel decoding with AWS Trainium and Inferentia chips, Rufus achieved two times faster response times, a 50% reduction in inference costs, and seamless scalability during peak traffic.

Scaling LLMs for Prime Day

Prime Day is one of the most demanding events for the Amazon infrastructure, pushing systems to their limits. In 2024, Rufus faced an unprecedented engineering challenge: handling millions of queries per minute and generating billions of tokens in real-time, all while maintaining a 300 ms latency SLA for QP tasks and minimizing power consumption. These demands required a fundamental rethinking of how LLMs are deployed at scale conquering the cost and performance bottlenecks. The key challenges of Prime Day included:

Traditional LLM text generation is inherently inefficient because of its sequential nature. Each token generation requires a full forward pass through the model, leading to high latency and underutilization of computational resources. While techniques like speculative decoding have been proposed to address these inefficiencies, their complexity and training overhead have limited their adoption.

AWS AI chips and parallel decoding

To overcome these challenges, Rufus adopted parallel decoding, a simple yet powerful technique for accelerating LLM generation. With parallel decoding, the sequential dependency is broken, making autoregressive generation faster. This approach introduces additional decoding heads to the base model eliminating the need for a separate draft model for speculated tokens. These heads predict multiple tokens in parallel for future positions before it knows the previous tokens, and this significantly improves generation efficiency.

To accelerate the performance of parallel decoding for online inference, Rufus used a combination of AWS solutions: Inferentia2 and Trainium AI Chips, Amazon Elastic Compute Cloud (Amazon EC2) and Application Load Balancer. In addition, the Rufus team partnered with NVIDIA to power the solution using NVIDIA’s Triton Inference Server, providing capabilities to host the model using AWS chips.

To get the maximum efficiency of parallel decoding on AWS Neuron Cores, we worked in collaboration with AWS Neuron team to add the architectural support of parallel decoding on a Neuronx-Distributed Inference (NxDI) framework for single batch size.

Rufus extended the base LLM with multiple decoding heads. These heads are a small neural network layer and are trained using the base model’s learned representations to predict the next tokens in parallel. These heads are trained together with the original model, keeping the base model unchanged.Because the tokens aren’t generated sequentially, they must be verified to make sure that all of the tokens fit together. To validate the tokens predicted by the draft heads, Rufus uses a tree-based attention mechanism to verify and integrate tokens. Each draft head produces several options for each position. These options are then organized into a tree-like structure to select the most promising combination. This allows multiple candidate tokens to be processed in parallel, reducing latency and increasing neuron core utilization. The following figure shows a sparse tree constructed using our calibration set, with a depth of four, indicating the involvement of four heads in the calculation process. Each node represents a token from a top-k prediction of a draft head, and the edges depict the connections between these nodes.

Results of using parallel decoding

By integrating parallel decoding with AWS AI chips and NxDI framework, we doubled the speed of text generation compared to autoregressive decoding, making it an ideal solution for the high-demand environment of Prime Day. During Amazon Prime Day 2024, Rufus demonstrated the power of AWS AI chips with impressive performance metrics:

These advances not only enhanced the customer experience but also showcased the potential of NxDI framework and the adaptability of AWS AI chips for optimizing large-scale LLM performance.

How to use parallel decoding on Trainium and Inferentia

The flexibility of NxDI combined with AWS Neuron chips makes it a powerful solution for LLM text generation in production. Whether you’re using Trainium or Inferentia for inference, NxDI provides a unified interface to implement parallel decoding optimizations. This integration reduces operational complexity and provides a straightforward path for organizations looking to deploy and scale their LLM applications efficiently.

You can explore parallel decoding technique such as Medusa to accelerate your inference workflows on INF2 or TRN1 instances. To get started, you’ll need a Medusa-compatible model (such as text-generation-inference/Mistral-7B-Instruct-v0.2-medusa) and a Medusa tree configuration. Enable Medusa by setting is_medusa=True, configuring your medusa_speculation_length, num_medusa_heads, and specifying your medusa_tree. When using the HuggingFace generate() API, set the assistant_model to your target model. Note that Medusa currently supports only a batch size of 1.

 def load_json_file(json_path):    with open(json_path, "r") as f:        return json.load(f)medusa_tree = load_json_file("medusa_mc_sim_7b_63.json")neuron_config = NeuronConfig(    is_medusa=True,    medusa_speculation_length=64,    num_medusa_heads=4,    medusa_tree=medusa_tree)

Conclusion

Prime Day is a testament to the power of innovation to overcome technical challenges. By using AWS AI chips, Rufus not only met the stringent demands of Prime Day but also set a new standard for LLM efficiency. As LLMs continue to evolve, frameworks such as NxDI will play a crucial role in making them more accessible, scalable, and cost-effective. We’re excited to see how the community will build on the NxDI foundation and AWS AI chips to unlock new possibilities for LLM applications. Try it out today and experience the difference for yourself!

Acknowledgments

We extend our gratitude to the AWS Annapurna team responsible for AWS AI chips and framework development. Special thanks to the researchers and engineers whose contributions made this achievement possible. The improvements in latency, throughput, and cost efficiency achieved with parallel decoding compared to autoregressive decoding have set a new benchmark for LLM deployments at scale.


About the authors

Shruti Dubey is a Software Engineer on Amazon’s Core Search Team, where she optimizes LLM inference systems to make AI faster and more scalable. She’s passionate about Generative AI and loves turning cutting-edge research into real-world impact. Outside of work, you’ll find her running, reading, or trying to convince her dog that she’s the boss.

Shivangi Agarwal is an Applied Scientist on Amazon’s Prime Video team, where she focuses on optimizing LLM inference and developing intelligent ranking systems for Prime Videos using query-level signals. She’s driven by a passion for building efficient, scalable AI that delivers real-world impact. When she’s not working, you’ll likely find her catching a good movie, discovering new places, or keeping up with her adventurous 3-year-old kid.

Sukhdeep Singh Kharbanda is an Applied Science Manager at Amazon Core Search. In his current role, Sukhdeep is leading Amazon Inference team to build GenAI inference optimization solutions and inference system at scale for fast inference at low cost. Outside work, he enjoys playing with his kid and cooking different cuisines.

Rahul Goutam is an Applied Science Manager at Amazon Core Search, where he leads teams of scientists and engineers to build scalable AI solutions that power flexible and intuitive shopping experiences. When he’s off the clock, he enjoys hiking a trail or skiing down one.

Yang Zhou is a software engineer working on building and optimizing machine learning systems. His recent focus is enhancing the performance and cost efficiency of generative AI inference. Beyond work, he enjoys traveling and has recently discovered a passion for running long distances.

RJ is an Engineer within Amazon. He builds and optimizes systems for distributed systems for training and works on optimizing adopting systems to reduce latency for ML Inference. Outside work, he is exploring using Generative AI for building food recipes.

James Park is a Principal Machine Learning Specialist Solutions Architect at Amazon Web Services. He works with Amazon to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, experiences, and staying up to date with the latest technology trends.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM AWS AI芯片 并行解码 Rufus Prime Day
相关文章