AWS Machine Learning Blog 2024年10月11日
Scaling Rufus, the Amazon generative AI-powered conversational shopping assistant with over 80,000 AWS Inferentia and AWS Trainium chips, for Prime Day
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon Rufus 是一款由生成式 AI 驱动的购物助手,它利用来自亚马逊和网络上的相关信息生成答案,帮助亚马逊客户做出更明智、更明智的购物决策。借助 Rufus,客户可以与一位了解亚马逊产品选择的生成式 AI 专家一起购物,并结合来自网络的信息,帮助购物者做出更明智的购买决策。为了满足亚马逊客户的大规模需求,Rufus 需要一个低成本、高性能且高可用的推理基础设施。该解决方案需要能够以低延迟在全球范围内为数十亿参数的大型语言模型 (LLM) 提供服务,以服务其庞大的客户群。低延迟确保用户在与 Rufus 聊天时拥有积极的体验,并且可以在不到一秒钟的时间内开始获得响应。为了实现这一点,Rufus 团队正在使用多个 AWS 服务和 AWS AI 芯片,AWS Trainium 和 AWS Inferentia。

🤔 **Rufus 的核心是基于亚马逊产品目录和网络信息训练的 LLM。**为了满足 Amazon Prime Day 等高峰事件的巨大需求,Rufus 需要部署和扩展,并考虑性能、环境影响和托管解决方案的成本。为了应对这些挑战,Rufus 使用了 AWS 解决方案的组合:Inferentia2 和 Trainium、Amazon Elastic Container Service (Amazon ECS) 以及 Application Load Balancer (ALB)。此外,Rufus 团队与 NVIDIA 合作,使用 NVIDIA 的 Triton Inference Server 为解决方案提供支持,提供使用 AWS 芯片托管模型的功能。

🚀 **Rufus 推理是一个检索增强生成 (RAG) 系统,其响应通过检索附加信息(例如来自亚马逊搜索结果的产品信息)来增强。**这些结果基于客户查询,确保 LLM 生成可靠、高质量和精确的响应。为了确保 Rufus 能够为 Prime Day 做好最佳准备,Rufus 团队使用 Inferentia2 和 Trainium 构建了一个跨多个 AWS 区域的异构推理系统。在多个区域构建系统使 Rufus 得益于两个关键领域。首先,它提供了在高需求时期可以使用的额外容量,其次,它提高了系统的整体弹性。

💡 **为了支持跨多个区域的实时流量路由,Rufus 构建了一个新颖的流量协调器。**Amazon CloudWatch 支持底层监控,帮助团队根据流量模式的变化在不到 15 分钟的时间内调整不同区域的流量比例。通过使用这种类型的编排,Rufus 团队能够在需要时将请求定向到其他区域,但对第一个令牌的延迟略有影响。由于 Rufus 的流式架构和区域之间的高性能 AWS 网络,最终用户的感知延迟很小。

🤖 **为了优化推理性能和主机利用率,Rufus 推理系统在每个区域都使用 Amazon ECS,它管理底层由 Inferentia 和 Trainium 支持的实例。**通过管理底层基础设施,Rufus 团队只需要通过定义 ECS 任务来提供他们的容器和配置。在每个容器中,使用带有 Python 后端的 NVIDIA Triton Inference Server 运行使用 Neuron SDK 的 vLLM。vLLM 是一种内存高效的推理和服务引擎,针对高吞吐量进行了优化。Neuron SDK 使团队能够轻松采用 AWS 芯片,并支持许多不同的库和框架,例如 PyTorch Lightning。

📈 **为了减少客户开始看到 Rufus 响应的总体等待时间,该团队还开发了一个推理流式架构。**由于 LLM 推理需要高计算量和内存负载,完成为客户查询生成完整响应所需的时间可能需要几秒钟。使用流式架构,Rufus 能够在令牌生成后立即返回它们。这种优化使客户能够在不到一秒钟的时间内开始使用响应。此外,多个服务使用 gRPC 连接协同工作,以实时智能地聚合和增强客户的流式响应。

💰 **尽管我们必须保持低延迟以获得最佳的客户体验,但通过实现高硬件资源利用率来扩展服务吞吐量也至关重要。**高硬件利用率确保加速器不会闲置并无谓地增加成本。为了优化推理系统吞吐量,该团队改进了单主机吞吐量以及负载均衡效率。

💪 **由于以下挑战,LLM 推理的负载均衡很棘手。**首先,单个主机只能处理有限数量的并发请求。其次,完成一个请求的端到端延迟可能会有所不同,具体取决于 LLM 响应长度,可能需要几秒钟。为了解决这些挑战,该团队通过考虑单主机吞吐量和使用负载均衡的多个主机的吞吐量来优化吞吐量。

🚀 **该团队使用了 ALB 的最少未完成请求 (LOR) 路由算法,与早期的基线测量相比,吞吐量提高了五倍。**这使每个主机都有足够的时间来处理正在进行的请求并使用 gRPC 连接流回响应,而不会被同时接收的多个请求淹没。Rufus 还与 AWS 和 vLLM 团队合作,使用 vLLM 与 Neuron SDK 和 NVIDIA Triton Inference Server 的集成来提高单主机并发性。

✨ **借助这种集成,Rufus 能够从一项关键优化中获益:持续批处理。**持续批处理允许单个主机大幅提高吞吐量。此外,与其他批处理技术(例如静态批处理)相比,持续批处理提供了独特的功能。例如,使用静态批处理时,第一个令牌的时间 (TTFT) 会随着一个批次中请求数量的增加而线性增加。持续批处理优先处理 LLM 推理的预填充阶段,即使在同时运行更多请求的情况下也能控制 TTFT。这帮助 Rufus 在生成第一个响应时提供愉快的低延迟体验,并提高单主机吞吐量以控制服务成本。

💡 **在本文中,我们讨论了 Rufus 如何使用 Neuron SDK 以及 Inferentia2 和 Trainium 芯片以及 AWS 服务来可靠地部署和服务其数十亿参数的 LLM。**Rufus 随着生成式 AI 和客户反馈的进步而不断发展,我们鼓励您使用 Inferentia 和 Trainium。

Amazon Rufus is a shopping assistant experience powered by generative AI. It generates answers using relevant information from across Amazon and the web to help Amazon customers make better, more informed shopping decisions. With Rufus, customers can shop alongside a generative AI-powered expert that knows Amazon’s selection inside and out, and can bring it all together with information from across the web to help shoppers make more informed purchase decisions.

To meet the needs of Amazon customers at scale, Rufus required a low-cost, performant, and highly available infrastructure for inference. The solution needed the capability to serve multi-billion parameter large language models (LLMs) with low latency across the world to service its expansive customer base. Low latency makes sure users have a positive experience chatting with Rufus and can start getting responses in less than a second. To achieve this, the Rufus team is using multiple AWS services and AWS AI chips, AWS Trainium and AWS Inferentia.

Inferentia and Trainium are purpose-built chips developed by AWS that accelerate deep learning workloads with high performance and lower overall costs. With these chips, Rufus reduced its costs by 4.5 times lower than other evaluated solutions while maintaining low latency for its customers. In this post, we dive into the Rufus inference deployment using AWS chips and how this enabled one of the most demanding events of the year—Amazon Prime Day.

Solution overview

At its core, Rufus is powered by an LLM trained on Amazon’s product catalog and information from across the web. LLM deployment can be challenging, requiring you to balance factors such as model size, model accuracy, and inference performance. Larger models generally have better knowledge and reasoning capabilities but come at a higher cost due to more demanding compute requirements and increasing latency. Rufus would need to be deployed and scale to meet the tremendous demand of peak events like Amazon Prime Day. Considerations for this scale include how well it needs to perform, its environmental impact, and the cost of hosting the solution. To meet these challenges, Rufus used a combination of AWS solutions: Inferentia2 and Trainium, Amazon Elastic Container Service (Amazon ECS), and Application Load Balancer (ALB). In addition, the Rufus team partnered with NVIDIA to power the solution using NVIDIA’s Triton Inference Server, providing capabilities to host the model using AWS chips.

Rufus inference is a Retrieval Augmented Generation (RAG) system with responses enhanced by retrieving additional information such as product information from Amazon search results. These results are based on the customer query, making sure the LLM generates reliable, high-quality, and precise responses.

To make sure Rufus was best positioned for Prime Day, the Rufus team built a heterogeneous inference system using multiple AWS Regions powered by Inferentia2 and Trainium. Building a system across multiple Regions allowed Rufus to benefit in two key areas. First, it provided additional capacity that could be used during times of high demand, and second, it improved the overall resiliency of the system.

The Rufus team was also able to use both Inf2 and Trn1 instance types. Because Inf2 and Trn1 instance types use the same AWS Neuron SDK, the Rufus team was able to use both instances to serve the same Rufus model. The only configuration setting to adjust was the tensor parallelism degree (24 for Inf2, 32 for Trn1). Using Trn1 instances also led to an additional 20% latency reduction and throughput improvement compared to Inf2.

The following diagram illustrates the solution architecture.

To support real-time traffic routing across multiple Regions, Rufus built a novel traffic orchestrator. Amazon CloudWatch supported the underlying monitoring, helping the team adjust the traffic ratio across the different Regions in less than 15 minutes based on the traffic pattern changes. By using this type of orchestration, the Rufus team had the ability to direct requests to other Regions when needed, with a small trade-off of latency to the first token. Due to Rufus’s streaming architecture and the performant AWS network between Regions, the perceived latency was minimal for end-users.

These choices allowed Rufus to scale up over 80,000 Trainium and Inferentia chips across three Regions serving an average of 3 million tokens a minute while maintaining P99 less than 1 second latency to the first response for Prime Day customers. In addition, by using these purpose-built chips, Rufus achieved 54% better performance per watt than other evaluated solutions, which helped the Rufus team meet energy efficiency goals.

Optimizing inference performance and host utilization

Within each Region, the Rufus inference system used Amazon ECS, which managed the underlying Inferentia and Trainium powered instances. By managing the underlying infrastructure, the Rufus team only needed to bring their container and configuration by defining an ECS task. Within each container, an NVIDIA Triton Inference Server with a Python backend is used running vLLM with the Neuron SDK. vLLM is a memory-efficient inference and serving engine that is optimized for high throughput. The Neuron SDK makes it straightforward for teams to adopt AWS chips and supports many different libraries and frameworks such as PyTorch Lightning.

The Neuron SDK provides a straightforward LLM inference solution on Trainium and Inferentia hardware with optimized performance supporting a wide range of transformer-based LLM architectures. To reduce latency, Rufus has collaborated with the AWS Annapurna team to develop various optimizations such as INT8 (weight only) quantization, continuous batching with vLLM, resource, compute, and memory bandwidth in the Neuron compiler and runtime. These optimizations are currently deployed in Rufus production and are available to use in the Neuron SDK 2.18 and onward.

To reduce overall waiting time for customers to start seeing a response from Rufus, the team also developed an inference streaming architecture. With the high compute and memory load needed for LLM inference, the total time it takes to finish generating the full response for a customer query can take multiple seconds. With a streaming architecture, Rufus is able to return the tokens right after they’re generated. This optimization allows the customer to start consuming the response in less than 1 second. In addition, multiple services work together using gRPC connections to intelligently aggregate and enhance the streaming response in real time for customers.

As shown in the following figure, images and links are embedded in the response, which allow customers to engage and continue exploring with Rufus.

Scaling up

Although we have to maintain low latency for the best customer experience, it’s also crucial to scale the service throughput by achieving high hardware resource utilization. High hardware utilization makes sure accelerators don’t sit idle and needlessly increase costs. To optimize the inference system throughput, the team improved both single-host throughput as well as load balancing efficiency.

Load balancing for LLM inference is tricky due to following challenges. First, a single host can only handle a limited number of concurrent requests. Second, the end-to-end latency to complete one request can vary, spanning many seconds depending on the LLM response length.

To address the challenges, the team optimized throughput by considering both single-host throughput and throughput across many hosts using load balancing.

The team used the least outstanding requests (LOR) routing algorithm from ALB, increasing throughput by five times faster in comparison to an earlier baseline measurement. This allows each host to have enough time to process in-flight requests and stream back responses using a gRPC connection, without getting overwhelmed by multiple requests received at the same time. Rufus also collaborated with AWS and vLLM teams to improve single-host concurrency using vLLM integration with the Neuron SDK and NVIDIA Triton Inference Server.

Figure 1. ECS tasks scale horizontally hosting the Triton Inference Server and dependencies

With this integration, Rufus was able to benefit from a critical optimization: continuous batching. Continuous batching allows a single host to greatly increase throughput. In addition, continuous batching provides unique capabilities in comparison to other batch techniques, such as static batching. For example, when using static batching, the time to first token (TTFT) increases linearly with the number of requests in one batch. Continuous batching prioritizes the prefill stage for LLM inference, keeping TTFT under control even with more requests running at the same time. This helped Rufus provide a pleasant experience with low latency when generating the first response, and improve the single-host throughput to keep serving costs under control.

Conclusion

In this post, we discussed how Rufus is able to reliably deploy and serve its multi-billion-parameter LLM using the Neuron SDK with Inferentia2 and Trainium chips and AWS services. Rufus continues to evolve with advancements in generative AI and customer feedback and we encourage you to use Inferentia and Trainium.

Learn more about how we are innovating with generative AI across Amazon.


About the author

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time, he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

RJ is an Engineer within Amazon. He builds and optimizes systems for distributed systems for training and works on optimizing adopting systems to reduce latency for ML Inference. Outside work, he is exploring using Generative AI for building food recipes.

Yang Zhou is a software engineer working on building and optimizing machine learning systems. His recent focus is enhancing the performance and cost efficiency of generative AI inference. Beyond work, he enjoys traveling and has recently discovered a passion for running long distances.

Adam (Hongshen) Zhao is a Software Development Manager at Amazon Stores Foundational AI. In his current role, Adam is leading Rufus Inference team to build GenAI inference optimization solutions and inference system at scale for fast inference at low cost. Outside work, he enjoys traveling with his wife and art creations.

Faqin Zhong is a software engineer at Amazon Stores Foundational AI, working on Large Language Model (LLM) inference infrastructure and optimizations. Passionate about Generative AI technology, Faqin collaborates with leading teams to drive innovations, making LLMs more accessible and impactful, ultimately enhancing customer experiences across diverse applications. Outside of work she enjoys cardio exercise and baking with her son.

Nicolas Trown is an engineer in Amazon Stores Foundational AI. His recent focus is lending his systems expertise across Rufus to aid Rufus Inference team and efficient utilization across the Rufus experience. Outside of work he enjoys spending time with his wife and day trips to nearby coast, Napa, and Sonoma areas.

Bing Yin is a director of science at Amazon Stores Foundational AI. He leads the effort to build LLMs that are specialized for shopping use cases and optimized for inference at Amazon scale. Outside of work, he enjoys running marathon races.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon Rufus 生成式 AI 购物助手 推理 AWS Inferentia Trainium LLM
相关文章