MarkTechPost@AI 2024年10月15日
Revolutionizing Fine-Tuned Small Language Model Deployments: Introducing Predibase’s Next-Gen Inference Engine
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Predibase 发布了 Predibase Inference Engine,这是一款旨在成为最佳微调小型语言模型 (SLM) 服务平台的新型基础设施产品。Predibase Inference Engine 通过使 SLM 部署更快、易于扩展且更具成本效益,显著改善了 SLM 部署,从而帮助企业应对生产化 AI 的复杂性。Predibase Inference Engine 基于 Predibase 的创新成果——Turbo LoRA 和 LoRA eXchange (LoRAX)——从头开始设计,旨在为微调 SLM 的服务提供最佳体验。

🚀 **性能瓶颈:**大多数云提供商的入门级 GPU 难以应对生产用例,尤其是那些峰值或可变工作负载的用例,导致响应时间缓慢,客户体验下降。此外,由于许多云环境中缺乏 GPU 自动扩展功能,因此扩展 LLM 部署以满足峰值需求而不会产生过高的成本或性能下降是一个重大挑战。

🏗️ **工程复杂性:**企业采用开源模型进行生产使用需要自行管理整个服务基础设施——这是一项风险高、资源密集型的提议。这增加了大量的工程复杂性,需要专门的专业知识,并迫使团队投入大量资源来确保生产环境中的可靠性能和可扩展性。

💰 **高基础设施成本:**高性能 GPU(如 NVIDIA H100 和 A100)需求量很大,而且云提供商通常供应有限,可能导致短缺。这些 GPU 通常以“始终开启”的部署模式提供,这可以确保可用性,但由于持续计费,无论实际使用情况如何,都可能很昂贵。

💡 **LoRAX:**LoRA eXchange (LoRAX) 允许从单个 GPU 服务数百个微调的 SLM。此功能通过最大程度地减少部署所需的 GPU 数量,显着降低了基础设施成本。对于需要部署各种专用模型而无需为每个模型分配 GPU 的开销的企业来说,它特别有用。

⚡ **Turbo LoRA:**Turbo LoRA 是我们的参数高效微调方法,它将吞吐量提高了 2-3 倍,同时在响应质量方面与 GPT-4 相当或超过 GPT-4。这些吞吐量改进极大地降低了推理成本和延迟,即使对于高容量用例也是如此。

📉 **FP8 量化:**实施 FP8 量化可以将部署微调 SLM 的内存占用减少 50%,从而导致吞吐量几乎提高 2 倍。这种优化不仅提高了性能,还提高了部署的成本效益,允许在相同数量的 GPU 上处理多达 2 倍的并发请求。

📈 **GPU 自动扩展:**Predibase SaaS 部署可以根据实时需求动态调整 GPU 资源。这种灵活性确保有效利用资源,在需求波动期间减少浪费和成本。

🚀 **Turbo LoRA 提高吞吐量 2-3 倍:**Turbo LoRA 将低秩自适应 (LoRA) 和推测性解码相结合,以增强 SLM 推理的性能。LoRA 通过添加针对特定任务定制的新参数来提高响应质量,但由于额外的计算步骤,它通常会减慢令牌生成速度。Turbo LoRA 通过使模型能够一步预测多个令牌来解决此问题,与没有影响输出质量的基础模型相比,吞吐量显着提高了 2-3 倍。

📉 **通过 FP8 进一步提高吞吐量:**FP8 量化是一种技术,它将模型数据格式的精度从标准浮点表示(如 FP16)降低到 8 位浮点格式。这种压缩将模型的内存占用减少了 50%,使其能够更有效地处理数据,并提高 GPU 的吞吐量。更小的尺寸意味着存储权重和执行矩阵乘法所需的内存更少,这反过来可以使给定 GPU 的吞吐量几乎翻倍。

📈 **针对性能和成本效率优化的 GPU 扩展:**GPU 自动扩展是管理 AI 工作负载的关键功能,它确保根据实时需求动态调整资源。我们的推理引擎能够根据需要扩展 GPU 资源,这有助于企业优化使用率,通过仅在需求增加时扩展并减少需求下降期间的资源来降低成本。这种灵活性使组织能够保持高性能的 AI 操作,而无需过度配置资源。

🔒 **为性能和成本效率优化的 GPU 扩展:**GPU 自动扩展是管理 AI 工作负载的关键功能,它确保根据实时需求动态调整资源。我们的推理引擎能够根据需要扩展 GPU 资源,这有助于企业优化使用率,通过仅在需求增加时扩展并减少需求下降期间的资源来降低成本。这种灵活性使组织能够保持高性能的 AI 操作,而无需过度配置资源。

🔒 **为高性能应用保留 GPU 容量:**对于需要一致性能的应用程序,我们的平台提供了保留 GPU 容量的选项,从而保证峰值负载期间的可用性。这对于响应时间至关重要的用例特别有价值,确保即使在流量激增期间,AI 模型也能在没有中断或延迟的情况下执行。保留容量可确保企业满足其性能 SLA,而无需不必要地过度分配资源。

🚀 **快速扩展以减少冷启动时间:**此外,推理引擎通过快速扩展资源来最大程度地减少冷启动时间,从而减少启动延迟并确保对流量突然增加做出快速调整。此功能增强了系统的响应能力,使组织能够高效地处理不可预测的流量激增,而不会影响性能。

💰 **通过 GPU 自动扩展降低部署成本:**除了优化性能外,GPU 自动扩展还显着降低了部署成本。与传统的“始终开启”的 GPU 部署不同,传统的“始终开启”的 GPU 部署无论实际使用情况如何,都会产生持续的费用,自动扩展确保仅在需要时才分配资源。

🚀 **Predibase Inference Engine:**Predibase 推出 Predibase Inference Engine,旨在成为最佳微调小型语言模型 (SLM) 服务平台的新型基础设施产品。Predibase Inference Engine 通过使 SLM 部署更快、易于扩展且更具成本效益,显著改善了 SLM 部署,从而帮助企业应对生产化 AI 的复杂性。Predibase Inference Engine 基于 Predibase 的创新成果——Turbo LoRA 和 LoRA eXchange (LoRAX)——从头开始设计,旨在为微调 SLM 的服务提供最佳体验。

🚀 **Predibase Inference Engine:**Predibase 推出 Predibase Inference Engine,旨在成为最佳微调小型语言模型 (SLM) 服务平台的新型基础设施产品。Predibase Inference Engine 通过使 SLM 部署更快、易于扩展且更具成本效益,显著改善了 SLM 部署,从而帮助企业应对生产化 AI 的复杂性。Predibase Inference Engine 基于 Predibase 的创新成果——Turbo LoRA 和 LoRA eXchange (LoRAX)——从头开始设计,旨在为微调 SLM 的服务提供最佳体验。

Predibase announces the Predibase Inference Engine, their new infrastructure offering designed to be the best platform for serving fine-tuned small language models (SLMs). The Predibase Inference Engine dramatically improves SLM deployments by making them faster, easily scalable, and more cost-effective for enterprises grappling with the complexities of productionizing AI. Built on Predibase’s innovations–Turbo LoRA and LoRA eXchange (LoRAX)–the Predibase Inference Engine is designed from the ground up to offer a best-in-class experience for serving fine-tuned SLMs.

The need for such an innovation is clear. As AI becomes more entrenched in the fabric of enterprise operations, the challenges associated with deploying and scaling SLMs have grown increasingly daunting. Homegrown infrastructure is often ill-equipped to handle the dynamic demands of high-volume AI workloads, leading to inflated costs, diminished performance, and operational bottlenecks. The Predibase Inference Engine addresses these challenges head-on, offering a tailor-made solution for enterprise AI deployments.

Join Predibase webinar on October 29th to learn more about the Predibase Inference Engine!

The Key Challenges in Deploying LLMs at Scale

As businesses continue to integrate AI into their core operations and need to prove ROI, the demand for efficient, scalable solutions has skyrocketed. The deployment of LLMs, and fine-tuned SLMs in particular, has become a critical component of successful AI initiatives but presents significant challenges at scale:

    Performance Bottlenecks: Most cloud providers’ entry-level GPUs struggle with production use cases, especially those with spiky or variable workloads, resulting in slow response times and a diminished customer experience. Additionally, scaling LLM deployments to meet peak demand without incurring prohibitive costs or performance degradation is a significant challenge due to the lack of GPU autoscaling capabilities in many cloud environments.Engineering Complexity: Adopting open-source models for production use requires enterprises to manage the entire serving infrastructure themselves—a high-stakes, resource-intensive proposition. This adds significant engineering complexity, demanding specialized expertise and forcing teams to devote substantial resources to ensure reliable performance and scalability in production environments.High Infrastructure Costs: High-performing GPUs like the NVIDIA H100 and A100 are in high demand and often have limited availability from cloud providers, leading to potential shortages. These GPUs are typically offered in “always-on” deployment models, which ensure availability but can be costly due to continuous billing, regardless of actual usage.

These challenges underscore the need for a solution like the Predibase Inference Engine, which is designed to streamline the deployment process and provide a scalable, cost-effective infrastructure for managing SLMs.

Technical Breakthroughs in the Predibase Inference Engine

At the heart of the Predibase Inference Engine are a set of innovative features that collectively enhance the deployment of SLMs:

These technical innovations are crucial for enterprises looking to deploy AI solutions that are both powerful and economical. By addressing the core challenges associated with traditional model serving, the Predibase Inference Engine sets a new standard for efficiency and scalability in AI deployments.

LoRA eXchange: Scale 100+ Fine-Tuned LLMs Efficiently on a Single GPU

LoRAX is a cutting-edge serving infrastructure designed to address the challenges of deploying multiple fine-tuned SLMs efficiently. Unlike traditional methods that require each fine-tuned model to run on dedicated GPU resources, LoRAX allows organizations to serve hundreds of fine-tuned SLMs on a single GPU, drastically reducing costs. By utilizing dynamic adapter loading, tiered weight caching, and multi-adapter batching, LoRAX optimizes GPU memory usage and maintains high throughput for concurrent requests. This innovative infrastructure enables cost-effective deployment of fine-tuned SLMs, making it easier for enterprises to scale AI models specialized to their unique tasks.

Get more out of your GPU: 4x speed improvements for SLMs with Turbo LoRA and FP8

Optimizing SLM inference is crucial for scaling AI deployments, and two key techniques are driving major throughput performance gains. Turbo LoRA boosts throughput by 2-3x through speculative decoding, making it possible to predict multiple tokens in one step without sacrificing output quality. Additionally, FP8 quantization further increases GPU throughput, enabling much more cost effective deployments when using modern hardware like NVIDIA L40S GPUs.

Turbo LoRA Increases Throughput by 2-3x

Turbo LoRA combines Low Rank Adaptation (LoRA) and speculative decoding to enhance the performance of SLM inference. LoRA improves response quality by adding new parameters tailored to specific tasks, but it typically slows down token generation due to the extra computational steps. Turbo LoRA addresses this by enabling the model to predict multiple tokens in one step, significantly increasing throughput by 2-3 times compared to base models without compromising output quality.

Turbo LoRA is particularly effective because it adapts to all types of GPUs, including high-performing models like H100s and entry level models like the A10g. This universal compatibility ensures that organizations can deploy Turbo LoRA across different hardware setups (whether in Predibase’s cloud or their VPC environment) without needing specific adjustments for each GPU type. This makes Turbo LoRA a cost-effective solution for enhancing the performance of SLMs across a wide range of computing environments. 

In addition, Turbo LoRA achieves these benefits all through a single model whereas the majority of speculative decoding implementations use a draft model in addition to their main model. This further reduces the GPU requirements and network overhead.

Further Increase Throughput with FP8

FP8 quantization is a technique that reduces the precision of a model’s data format from a standard floating-point representation, such as FP16, to an 8-bit floating-point format. This compression reduces the model’s memory footprint by up to 50%, allowing it to process data more efficiently and increasing throughput on GPUs. The smaller size means that less memory is required to store weights and perform matrix multiplications, which consequently can nearly double the throughput of a given GPU.

Beyond just performance enhancements, FP8 quantization also impacts the cost-efficiency of deploying SLMs. By increasing the number of concurrent requests a GPU can handle, organizations can meet their performance SLAs with fewer compute resources. While only the latest generation of NVIDIA GPUs support FP8, applying FP8 to L40S GPUs–now more readily available in Amazon EC2–increases throughput to outperform an A100 GPU while costing roughly 33% less.

Optimized GPU Scaling for Performance and Cost Efficiency

GPU autoscaling is a critical feature for managing AI workloads, ensuring that resources are dynamically adjusted based on real-time demand. Our Inference Engine’s ability to scale GPU resources as needed helps enterprises optimize usage, reducing costs by only scaling up when demand increases and scaling down during quieter periods. This flexibility allows organizations to maintain high-performance AI operations without over-provisioning resources.

For applications that require consistent performance, our platform offers the option to reserve GPU capacity, guaranteeing availability during peak loads. This is particularly valuable for use cases where response times are crucial, ensuring that even during traffic spikes, AI models perform without interruptions or delays. Reserved capacity ensures enterprises meet their performance SLAs without unnecessary over-allocation of resources.

Additionally, the Inference Engine minimizes cold start times by rapidly scaling resources, reducing delays in startup and ensuring quick adjustments to sudden increases in traffic. This feature enhances the responsiveness of the system, allowing organizations to handle unpredictable traffic surges efficiently and without compromising on performance.

In addition to optimizing performance, GPU autoscaling significantly reduces deployment costs. Unlike traditional “always-on” GPU deployments, which incur continuous expenses regardless of actual usage, autoscaling ensures resources are allocated only when needed. In the example above, a standard always-on deployment for an enterprise workload would cost over $213,000 per year, while an autoscaling deployment reduces that to less than $155,000 annually—offering a savings of nearly 30%. (It’s important to note that both deployment configurations cost less than half as much as using fine-tuned GPT-4o-mini.) By dynamically adjusting GPU resources based on real-time demand, enterprises can achieve high performance without the burden of overpaying for idle infrastructure, making AI deployments far more cost-effective.

Enterprise readiness

Designing AI infrastructure for enterprise applications is complex, with many critical details to manage if you’re building your own. From security compliance to ensuring high availability across regions, enterprise-scale deployments require careful planning. Teams must balance performance, scalability, and cost-efficiency while integrating with existing IT systems.

Predibase’s Inference Engine simplifies this by offering enterprise-ready solutions that address these challenges, including VPC integration, multi-region high availability, and real-time deployment insights. These features help enterprises like Convirza deploy and manage AI workloads at scale without the operational burden of building and maintaining infrastructure themselves.

“At Convirza, our workload can be extremely variable, with spikes that require scaling up to double-digit A100 GPUs to maintain performance. The Predibase Inference Engine and LoRAX allow us to efficiently serve 60 adapters while consistently achieving an average response time of under two seconds,” said Giuseppe Romagnuolo, VP of AI at Convirza. “Predibase provides the reliability we need for these high-volume workloads. The thought of building and maintaining this infrastructure on our own is daunting—thankfully, with Predibase, we don’t have to.”

Our cloud or yours: Virtual Private Clouds

The Predibase Inference Engine is available in our cloud or yours. Enterprises can choose between deploying within their own private cloud infrastructure or utilizing Predibase’s fully managed SaaS platform. This flexibility ensures seamless integration with existing enterprise IT policies, security protocols, and compliance requirements. Whether companies prefer to keep their data and models entirely within their Virtual Private Cloud (VPC) for enhanced security and to take advantage of cloud provider spend commitments or leverage Predibase’s SaaS for added flexibility, the platform adapts to meet diverse enterprise needs.

Multi-Region High Availability

The Inference Engine’s multi-region deployment feature ensures that enterprises can maintain uninterrupted service, even in the event of regional outages or disruptions. In the event of a disruption, the platform automatically reroutes traffic to a functioning region and spins up additional GPUs to handle the increased demand. This rapid scaling of resources minimizes downtime and ensures that enterprises can maintain their service-level agreements (SLAs) without compromising performance or reliability.

By dynamically provisioning extra GPUs in the failover region, the Inference Engine provides immediate capacity to support critical AI workloads, allowing businesses to continue operating smoothly even in the face of unexpected failures. This combination of multi-region redundancy and autoscaling guarantees that enterprises can deliver consistent, high-performance services to their users, no matter the circumstances.

Maximizing Efficiency with Real-Time Deployment Insights

In addition to the Inference Engine’s powerful autoscaling and multi-region capabilities, Predibase’s Deployment Health Analytics provide essential real-time insights for monitoring and optimizing your deployments. This tool tracks critical metrics like request volume, throughput, GPU utilization, and queue duration, giving you a comprehensive view of how well your infrastructure is performing. By using these insights, enterprises can easily balance performance with cost efficiency, scaling GPU resources up or down as needed to meet fluctuating demand while avoiding over-provisioning.

With customizable autoscaling thresholds, Deployment Health Analytics allows you to fine-tune your strategy based on specific operational needs. Whether it’s ensuring that GPUs are efficiently utilized during traffic spikes or scaling down resources to minimize costs, these analytics empower businesses to maintain high-performance deployments that run smoothly at all times. For more details on optimizing your deployment strategy, check out the full blog post.

Why Choose Predibase?

Predibase is the leading platform for enterprises serving fine-tuned LLMs, offering unmatched infrastructure designed to meet the specific needs of modern AI workloads. Our Inference Engine is built for maximum performance, scalability, and security, ensuring enterprises can deploy fine-tuned models with confidence. With built-in compliance and a focus on cost-effective, reliable model serving, Predibase is the top choice for companies looking to serve fine-tuned LLMs at scale while maintaining enterprise-grade security and efficiency.

If you’re ready to take your LLM deployments to the next level, visit Predibase.com to learn more about the Predibase Inference Engine, or try it for free to see firsthand how our solutions can transform your AI operations.


Thanks to the Predibase team for the thought leadership/ Resources for this article. The Predibase AI team has supported us in this content/article.

The post Revolutionizing Fine-Tuned Small Language Model Deployments: Introducing Predibase’s Next-Gen Inference Engine appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Predibase Inference Engine 微调语言模型 SLM AI 基础设施 Turbo LoRA LoRAX GPU 自动扩展 FP8 量化
相关文章