AWS Machine Learning Blog 2024年08月09日
Cisco achieves 50% latency improvement using Amazon SageMaker Inference faster autoscaling feature
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Cisco Webex AI 团队利用 Amazon SageMaker 推动生成式 AI 的发展,并通过其更快自动扩展功能来改进其生成式 AI 推理端点的延迟。该团队通过将大型语言模型迁移到 Amazon SageMaker 推理,实现了更快的推理速度、可扩展性和成本效益。

💻 Cisco Webex AI 团队将生成式 AI 应用于其联络中心解决方案,实现客户与座席之间更自然、更人性化的对话。AI 可以生成与上下文相关的同理心回应,以回答客户的询问,并自动起草个性化的电子邮件和聊天消息。这有助于联络中心座席更高效地工作,同时保持高水平的客户服务。

📃 为了解决这些挑战,WxAI 团队转向 SageMaker 推理——一种完全托管的 AI 推理服务,允许无缝部署和扩展模型,与使用它们的应用程序无关。通过将 LLM 托管与 Webex 应用程序分离,WxAI 可以为模型提供必要的计算资源,而不会影响核心协作和通信功能。

📁 为了优化实时推理工作负载,SageMaker 采用应用程序自动扩展(自动扩展)。此功能会动态调整正在使用的实例数量和部署的模型副本数量(在使用推理组件时),以响应对实时需求的变化。当端点的流量超过预定义阈值时,自动扩展会增加可用实例的数量并部署额外的模型副本以满足不断增长的需求。相反,随着工作负载的减少,系统会自动删除不必要的实例和模型副本,有效地降低成本。这种自适应缩放确保了资源得到最佳利用,实时平衡性能需求和成本考虑。

🔥 与 Cisco 合作,Amazon SageMaker 发布了新的亚分钟级高分辨率预定义指标类型 SageMakerVariantConcurrentRequestsPerModelHighResolution ,用于更快地自动扩展和缩短检测时间。这种更新的高分辨率指标已显示出将扩展检测时间缩短了 6 倍(与现有的 SageMakerVariantInvocationsPerInstance 指标相比),从而将整体端到端推理延迟提高了 50%,在托管像 Llama3-8B 这样的生成式 AI 模型的端点上。

📡 Cisco 评估了 Amazon SageMaker 的新预定义指标类型,以在其生成式 AI 工作负载上实现更快的自动扩展。他们观察到,与现有的 SageMakerVariantInvocationsPerInstance 指标相比,使用新的 SageMakerequestsPerModelHighResolution 指标,端到端推理延迟提高了 50%。

📢 此外,SageMaker 现在还会发出新的 CloudWatch 指标,包括 ConcurrentRequestsPerModel 和 ConcurrentRequestsPerModelCopy,它们更适合于监控和扩展托管大型语言模型 (LLM) 和基础模型 (FM) 的端点。这种增强的自动扩展功能彻底改变了 Cisco 的工作方式,帮助提高其关键生成式 AI 应用程序的性能和效率。

This post is co-authored with Travis Mehlinger and Karthik Raghunathan from Cisco.

Webex by Cisco is a leading provider of cloud-based collaboration solutions which includes video meetings, calling, messaging, events, polling, asynchronous video and customer experience solutions like contact center and purpose-built collaboration devices. Webex’s focus on delivering inclusive collaboration experiences fuels our innovation, which leverages AI and Machine Learning, to remove the barriers of geography, language, personality, and familiarity with technology. Its solutions are underpinned with security and privacy by design. Webex works with the world’s leading business and productivity apps – including AWS.

Cisco’s Webex AI (WxAI) team plays a crucial role in enhancing these products with AI-driven features and functionalities, leveraging LLMs to improve user productivity and experiences. In the past year, the team has increasingly focused on building artificial intelligence (AI) capabilities powered by large language models (LLMs) to improve productivity and experience for users. Notably, the team’s work extends to Webex Contact Center, a cloud-based omni-channel contact center solution that empowers organizations to deliver exceptional customer experiences. By integrating LLMs, WxAI team enables advanced capabilities such as intelligent virtual assistants, natural language processing, and sentiment analysis, allowing Webex Contact Center to provide more personalized and efficient customer support. However, as these LLM models grew to contain hundreds of gigabytes of data, WxAI team faced challenges in efficiently allocating resources and starting applications with the embedded models. To optimize its AI/ML infrastructure, Cisco migrated its LLMs to Amazon SageMaker Inference, improving speed, scalability, and price-performance.

This blog post highlights how Cisco implemented faster autoscaling release reference. For more details on Cisco’s Use Cases, Solution & Benefits see How Cisco accelerated the use of generative AI with Amazon SageMaker Inference.

In this post, we will discuss the following:

    Overview of Cisco’s use-case and architecture Introduce new faster autoscaling feature
      Single Model real-time endpoint Deployment using Amazon SageMaker InferenceComponents
    Share results on the performance improvements Cisco saw with faster autoscaling feature for GenAI inference Next Steps

Cisco’s Use-case: Enhancing Contact Center Experiences

Webex is applying generative AI to its contact center solutions, enabling more natural, human-like conversations between customers and agents. The AI can generate contextual, empathetic responses to customer inquiries, as well as automatically draft personalized emails and chat messages. This helps contact center agents work more efficiently while maintaining a high level of customer service.

Architecture

Initially, WxAI embedded LLM models directly into the application container images running on Amazon Elastic Kubernetes Service (Amazon EKS). However, as the models grew larger and more complex, this approach faced significant scalability and resource utilization challenges. Operating the resource-intensive LLMs through the applications required provisioning substantial compute resources, which slowed down processes like allocating resources and starting applications. This inefficiency hampered WxAI’s ability to rapidly develop, test, and deploy new AI-powered features for the Webex portfolio.

To address these challenges, WxAI team turned to SageMaker Inference – a fully managed AI inference service that allows seamless deployment and scaling of models independently from the applications that use them. By decoupling the LLM hosting from the Webex applications, WxAI could provision the necessary compute resources for the models without impacting the core collaboration and communication capabilities.

“The applications and the models work and scale fundamentally differently, with entirely different cost considerations, by separating them rather than lumping them together, it’s much simpler to solve issues independently.”

– Travis Mehlinger, Principal Engineer at Cisco. 

This architectural shift has enabled Webex to harness the power of generative AI across its suite of collaboration and customer engagement solutions.

Today Sagemaker endpoint uses autoscaling with invocation per instance. However, it takes ~6 minutes to detect need for autoscaling.

Introducing new Predefined metric types for faster autoscaling

Cisco Webex AI team wanted to improve their inference auto scaling times, so they worked with Amazon SageMaker to improve inference.

Amazon SageMaker’s real-time inference endpoint offers a scalable, managed solution for hosting Generative AI models. This versatile resource can accommodate multiple instances, serving one or more deployed models for instant predictions. Customers have the flexibility to deploy either a single model or multiple models using SageMaker InferenceComponents on the same endpoint. This approach allows for efficient handling of diverse workloads and cost-effective scaling.

To optimize real-time inference workloads, SageMaker employs application automatic scaling (auto scaling). This feature dynamically adjusts both the number of instances in use and the quantity of model copies deployed (when using inference components), responding to real-time changes in demand. When traffic to the endpoint surpasses a predefined threshold, auto scaling increases the available instances and deploys additional model copies to meet the heightened demand. Conversely, as workloads decrease, the system automatically removes unnecessary instances and model copies, effectively reducing costs. This adaptive scaling ensures that resources are optimally utilized, balancing performance needs with cost considerations in real-time.

Working with Cisco, Amazon SageMaker releases new sub-minute high-resolution pre-defined metric type SageMakerVariantConcurrentRequestsPerModelHighResolution for faster autoscaling and reduced detection time. This newer high-resolution metric has shown to reduce scaling detection times by up to 6x (compared to existing SageMakerVariantInvocationsPerInstance metric) and thereby improving overall end-to-end inference latency by up to 50%, on endpoints hosting Generative AI models like Llama3-8B.

With this new release, SageMaker real-time endpoints also now emits new ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopy CloudWatch metrics as well, which are more suited for monitoring and scaling Amazon SageMaker endpoints hosting LLMs and FMs.

Cisco’s Evaluation of faster autoscaling feature for GenAI inference

Cisco evaluated Amazon SageMaker’s new pre-defined metric types for faster autoscaling on their Generative AI workloads. They observed up to a 50% latency improvement in end-to-end inference latency by using the new SageMakerequestsPerModelHighResolution metric, compared to the existing SageMakerVariantInvocationsPerInstance  metric.

The setup involved using their Generative AI models, on SageMaker’s real-time inference endpoints. SageMaker’s autoscaling feature dynamically adjusted both the number of instances and the quantity of model copies deployed to meet real-time changes in demand. The new high-resolution SageMakerVariantConcurrentRequestsPerModelHighResolution metric reduced scaling detection times by up to 6x, enabling faster autoscaling and lower latency.

In addition, SageMaker now emits new CloudWatch metrics, including ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopy, which are better suited for monitoring and scaling endpoints hosting large language models (LLMs) and foundation models (FMs). This enhanced autoscaling capability has been a game-changer for Cisco, helping to improve the performance and efficiency of their critical Generative AI applications.

We are really pleased with the performance improvements we’ve seen from Amazon SageMaker’s new autoscaling metrics. The higher-resolution scaling metrics have significantly reduced latency during initial load and scale-out on our Gen AI workloads. We’re excited to do a broader rollout of this feature across our infrastructure

– Travis Mehlinger, Principal Engineer at Cisco.

Cisco further plans to work with SageMaker inference to drive improvements in rest of the variables that impact autoscaling latencies. Like model download and load times.

Conclusion

Cisco’s Webex AI team is continuing to leverage Amazon SageMaker Inference to power generative AI experiences across its Webex portfolio. Evaluation with faster autoscaling from SageMaker has shown Cisco up to 50% latency improvements in its GenAI inference endpoints. As WxAI team continues to push the boundaries of AI-driven collaboration, its partnership with Amazon SageMaker will be crucial in informing upcoming improvements and advanced GenAI inference capabilities. With this new feature Cisco looks forward to further optimizing its AI Inference performance by rolling it broadly in multiple regions and delivering even more impactful generative AI features to its customers.


About the Authors

Travis Mehlinger is a Principal Software Engineer in the Webex Collaboration AI group, where he helps teams develop and operate cloud-native AI and ML capabilities to support Webex AI features for customers around the world.In his spare time, Travis enjoys cooking barbecue, playing video games, and traveling around the US and UK to race go karts.

Karthik Raghunathan is the Senior Director for Speech, Language, and Video AI in the Webex Collaboration AI Group. He leads a multidisciplinary team of software engineers, machine learning engineers, data scientists, computational linguists, and designers who develop advanced AI-driven features for the Webex collaboration portfolio. Prior to Cisco, Karthik held research positions at MindMeld (acquired by Cisco), Microsoft, and Stanford University.

Praveen Chamarthi is a Senior AI/ML Specialist with Amazon Web Services. He is passionate about AI/ML and all things AWS. He helps customers across the Americas to scale, innovate, and operate ML workloads efficiently on AWS. In his spare time, Praveen loves to read and enjoys sci-fi movies.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, multi-tenant models, cost optimizations, and making deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Ravi Thakur is a Sr Solutions Architect Supporting Strategic Industries at AWS, and is based out of Charlotte, NC. His career spans diverse industry verticals, including banking, automotive, telecommunications, insurance, and energy. Ravi’s expertise shines through his dedication to solving intricate business challenges on behalf of customers, utilizing distributed, cloud-native, and well-architected design patterns. His proficiency extends to microservices, containerization, AI/ML, Generative AI, and more. Today, Ravi empowers AWS Strategic Customers on personalized digital transformation journeys, leveraging his proven ability to deliver concrete, bottom-line benefits.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon SageMaker 生成式 AI 自动扩展 Cisco Webex LLM
相关文章