MarkTechPost@AI 2024年07月21日
Together AI Unveils Revolutionary Inference Stack: Setting New Standards in Generative AI Performance
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Together AI推出了一个突破性的AI推理栈,该栈在解码吞吐量方面比开源vLLM快4倍,并超越了亚马逊Bedrock、Azure AI、Fireworks和Octo AI等领先的商业解决方案1.3倍到2.5倍。Together推理引擎能够在Meta Llama 3 8B上每秒处理超过400个token,集成了Together AI的最新创新,包括FlashAttention-3、更快的GEMM和MHA内核、保质量量化以及推测解码技术。

🚀 **Together Turbo端点**:这些端点提供快速的FP8性能,同时保持与FP16模型相似的质量。它们在AlpacaEval 2.0上超越了其他FP8解决方案,最高提升了2.5个点。Together Turbo端点以8B模型0.18美元和70B模型0.88美元的价格提供,比GPT-4o便宜17倍。

💡 **Together Lite端点**:这些端点利用多种优化,提供了最具成本效益和可扩展的Llama 3模型,与全精度实现相比具有出色的质量。Llama 3 8B Lite模型的定价为每百万个token 0.10美元。

⚡️ **Together参考端点**:这些端点为Meta Llama 3模型提供最快的全精度FP16支持,性能比vLLM快4倍。

🌟 **技术进步**:Together推理引擎集成了许多技术进步,包括专有内核(如FlashAttention-3)、基于RedPajama的定制构建的推测器以及市场上最精确的量化技术。这些创新确保了领先的性能,而不会牺牲质量。

💰 **成本效益**:Together Turbo端点是另一个主要优势,其成本比GPT-4o低10倍以上,并显著降低了客户在Together Cloud上托管其专用端点的成本。另一方面,Together Lite端点比vLLM降低了12倍的成本,使其成为大规模生产部署中最经济的解决方案。

📈 **持续创新**:Together推理引擎不断整合来自AI社区和Together AI内部研究的尖端创新。FlashAttention-3和推测解码算法(如Medusa和Sequoia)等最新进展突出了持续的优化工作。保质量量化确保即使在低精度的情况下,模型的性能和准确性也能得到保持。这些创新提供了灵活性,可以根据现代企业对性能、质量和成本效益的要求来扩展应用程序。

Together AI has unveiled a groundbreaking advancement in AI inference with its new inference stack. This stack, which boasts a decoding throughput four times faster than the open-source vLLM, surpasses leading commercial solutions like Amazon Bedrock, Azure AI, Fireworks, and Octo AI by 1.3x to 2.5x. The Together Inference Engine, capable of processing over 400 tokens per second on Meta Llama 3 8B, integrates the latest innovations from Together AI, including FlashAttention-3, faster GEMM and MHA kernels, and quality-preserving quantization, as well as speculative decoding techniques.

Additionally, Together AI has introduced the Together Turbo and Together Lite endpoints, starting with Meta Llama 3 and expanding to other models shortly. These endpoints offer enterprises a balance of performance, quality, and cost-efficiency. Together Turbo provides performance that closely matches full-precision FP16 models, making it the fastest engine for Nvidia GPUs and the most accurate, cost-effective solution for building generative AI at production scale. Together Lite endpoints leverage INT4 quantization for the most cost-efficient and scalable Llama 3 models available, priced at just $0.10 per million tokens, which is six times lower than GPT-4o-mini.

The new release includes several key components:


Check out the Detail. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here

The post Together AI Unveils Revolutionary Inference Stack: Setting New Standards in Generative AI Performance appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Together AI 推理栈 生成式AI 性能 模型优化
相关文章