TechCrunch News 2024年12月05日
AWS makes its SageMaker HyperPod AI platform more efficient for training LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

亚马逊云科技在今年的re:Invent大会上宣布了SageMaker HyperPod平台的一系列更新,旨在帮助企业更有效率地进行基础模型的训练和微调。这些更新包括灵活的训练计划,帮助用户设定时间线和预算,并根据需求自动分配GPU资源;HyperPod Recipes,提供针对Llama和Mistral等常用模型架构的优化方案;以及资源池功能,允许企业集中管理和分配GPU资源,从而提高资源利用率和降低成本。这些更新旨在解决企业在训练大型语言模型时遇到的容量、成本和资源分配等挑战,帮助企业更有效地利用资源,并降低AI项目的成本。

🤔 **灵活的训练计划:**AWS推出了灵活的训练计划,允许用户设定模型训练的时间线和预算,SageMaker HyperPod会自动寻找最佳的容量组合并创建计划,帮助用户避免过度配置服务器并降低成本。

💡 **HyperPod Recipes:**针对Llama和Mistral等常用模型架构,AWS提供了HyperPod Recipes,这些预设方案包含了最佳实践,并优化了检查点频率,帮助用户更有效地进行模型微调。

🔄 **资源池功能:**AWS允许企业创建GPU资源池,并根据项目优先级自动分配资源,从而提高资源利用率。例如,企业可以将资源分配给白天用于推理,晚上用于训练,以最大化资源利用率。

📊 **显著降低成本:**通过灵活的训练计划、HyperPod Recipes和资源池功能,企业可以显著降低AI项目的成本,AWS内部测试表明,成本降低幅度最高可达40%。

At last year’s AWS re:Invent conference, Amazon’s cloud computing unit launched SageMaker HyperPod, a platform for building foundation models. It’s no surprise then that at this year’s re:Invent, the company is announcing a number of updates to the platform, with a focus on making model training and fine-tuning on HyperPod more efficient and cost-effective for enterprises.

HyperPod is now in use by companies like Salesforce, Thompson Reuters and BMW and AI startups like Luma, Perplexity, Stability AI and Hugging Face. It’s the needs of these customers that AWS is now addressing with today’s updates, Ankur Mehrotra, the GM in charge of HyperPod at AWS, told me.

One of the challenges these companies face is that there often simply isn’t enough capacity for running their LLM training workloads.

“Oftentimes, because of high demand, capacity can be expensive as well as it can be hard to find capacity when you need it, how much you need, and exactly where you need it,” Mehrotra said. “Then, what may happen is you may find capacity in specific blocks, which may be split across time and also location. Customers may need to start at one place and then move their workload to another place and all that — and then also set up and reset their infrastructure to do that again and again.”

To make this easier, AWS is launching what it calls ‘flexible training plans.’ With this, HyperPod users can set a timeline and budget. Say they want to complete the training of a model within the next two month and expect to need 30 full days of training with a specific GPU type to achieve that. SageMaker HyperPod can then go out, find the best combination of capacity blocks and create a plan to make this happen. SageMaker handles the infrastructure provisioning and runs the jobs (and pauses them when the capacity is not available).

Image Credits:AWS

Ideally, Mehrotra noted, this can help these businesses avoid overspending by overprovisioning servers for their training jobs.

Many times, though, these businesses aren’t training models from scratch. Instead, they are fine-tuning models using their own data on top of open weight models and model architectures like Meta’s Llama. For them, the SageMaker team is launching HyperPod Recipes. These are benchmarked and optimized recipes for common architectures like Llama and Mistral that encapsulate the best practices for using these models.

Mehrotra stressed that these recipes also figure out the right checkpoint frequency for a given workload to ensure that the progress of the training job is saved regularly.

Image Credits:AWS

As the number of teams working with generative AI in a company grows, different teams will likely provision their own capacity, which in return means that some of those GPUs will sit idle and eat into a company’s overall AI budget. To combat this, AWS is now allowing enterprises to essentially pool those resources and create a central command center for allocating GPU capacity based on a project’s priority. The system can then allocate resources automatically as needed (or determined by the internal pecking order, which may not always be the same thing).

Another capability this enables is for companies to use most of their allocation for running inference during the day to serve their customers and then allocate more of those resources to training during the night, when there is less demand for inferencing.

As it turns out, AWS first built this capability for Amazon itself and the company saw the utilization of its cluster go to over 90% because of this new tool.

“Organization really want to innovate, and they have so many ideas. Generative AI is such a new technology. There are so many new ideas. And so they do run into these resource and budget constraints. So it’s about doing the work more efficiently and we can really help customers reduce costs — and this overall helps reduce costs by, we’ve looked at it, up to 40% for organizations.”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AWS SageMaker HyperPod 基础模型 AI训练 GPU资源
相关文章