AWS Machine Learning Blog 07月23日 00:47
Beyond accelerators: Lessons from building foundation models on AWS with Japan’s GENIAC program
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

日本经济产业省(METI)推出的Generative AI Accelerator Challenge(GENIAC)旨在通过提供资金、指导和计算资源来推动生成式AI发展。AWS作为第二周期(cycle 2)的云服务提供商,为12家参与企业提供了基础设施和技术支持。文章分享了AWS在支持大规模基础模型(FM)训练过程中的关键经验和教训,强调了构建可靠系统和解决分布式训练挑战的重要性。通过跨职能团队协作、建立高效沟通渠道以及提供参考架构和部署指南,AWS帮助客户克服了技术障碍,成功训练了多个大型模型,为企业和国家级AI项目提供了宝贵的实践经验。

💡 **跨职能团队协作是规模化AI项目的基石**:AWS为GENIAC项目建立了虚拟的跨职能团队,整合了客户团队、AWS客户团队(解决方案架构师、客户经理)以及专门负责大规模机器学习的WWSO Frameworks团队。这种多层级协作结构,以及由Lead Solutions Architects(Lead SAs)担任的客户接口人,确保了技术指导能够有效扩展,并为客户提供及时的支持和问题解决。

🚀 **高效沟通与详实文档加速问题解决**:通过建立内部和外部的Slack沟通渠道,AWS实现了客户问题与技术专家的实时对接,缩短了问题解决时间。同时,维护全面的工作负载跟踪文档,记录客户训练细节和基础设施配置,并定期举行评审会议,促进了知识共享和经验复用,从而持续优化了服务模式,并为未来的项目提供了洞察。

🏗️ **参考架构与自动化部署降低技术门槛**:AWS提供了预先验证的参考架构模板和自动化工具,包括AWS ParallelCluster(用户管理HPC集群)和SageMaker HyperPod(托管集群服务),涵盖了从计算、网络到存储的完整技术栈。这些基于GitHub的代码库和CloudFormation模板,极大地简化了客户部署和配置训练环境的过程,使他们能够专注于模型开发而非基础设施管理。

📚 **结构化赋能与实战演练提升用户能力**:GENIAC项目通过大规模的赋能会议和专门的研讨会,向参与者传授了在AWS上进行大规模FM训练的最佳实践,包括基础设施设置、分布式训练框架、性能监控和故障排除。这些结构化的培训内容和动手实践环节,帮助客户和AWS工程师建立了共同的知识基础和技术工具包,为客户的独立部署和问题解决奠定了坚实基础。

In 2024, the Ministry of Economy, Trade and Industry (METI) launched the Generative AI Accelerator Challenge (GENIAC)—a Japanese national program to boost generative AI by providing companies with funding, mentorship, and massive compute resources for foundation model (FM) development. AWS was selected as the cloud provider for GENIAC’s second cycle (cycle 2). It provided infrastructure and technical guidance for 12 participating organizations. On paper, the challenge seemed straightforward: give each team access to hundreds of GPUs/Trainium chips and let innovation ensue. In practice, successful FM training required far more than raw hardware.

AWS discovered that allocating over 1,000 accelerators was merely the starting point—the real challenge lay in architecting a reliable system and overcoming distributed training obstacles. During GENIAC cycle 2, 12 customers successfully deployed 127 Amazon EC2 P5 instances (NVIDIA H100 TensorCore GPU servers) and 24 Amazon EC2 Trn1 instances (AWS Trainium1 servers) in a single day. Over the following 6 months, multiple large-scale models were trained, including notable projects like Stockmark-2-100B-Instruct-beta, Llama 3.1 Shisa V2 405B, and Llama-3.1-Future-Code-Ja-8B, and others.

This post shares the key insights from this engagement and valuable lessons for enterprises or national initiatives aiming to build FMs at scale.

Cross-functional engagement teams

A crucial early lesson from technical engagement for the GENIAC was that running a multi-organization, national-scale machine learning (ML) initiative requires coordinated support across diverse internal teams. AWS established a virtual team that brought together account teams, specialist Solutions Architects, and service teams. The GENIAC engagement model thrives on close collaboration between customers and a multi-layered AWS team structure, as illustrated in the following figure.

Customers (Cx) typically consist of business and technical leads, including ML and platform engineers, and are responsible for executing training workloads. AWS account teams (Solutions Architects and Account Managers) manage the relationship, maintain documentation, and maintain communication flows with customers and internal specialists. The World Wide Specialist Organization (WWSO) Frameworks team specializes in large-scale ML workloads, with a focus on core HPC and container services such as AWS ParallelCluster, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker HyperPod. The WWSO Frameworks team is responsible for establishing this engagement structure and supervising technical engagements in this program. They lead the engagement in partnership with other stakeholders and serve as an escalation point for other stakeholders. They work directly with the service teams—Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon FSx, and SageMaker HyperPod—to help navigate engagements, escalations (business and technical), and make sure the engagement framework is in working order. They provide guidance on training and inference to customers and educate other teams on the technology. The WWSO Frameworks team worked closely with Lead Solutions Architects (Lead SAs), a role specifically designated to support GENIAC engagements. These Lead SAs serve as a cornerstone of this engagement. They are an extension of the Frameworks specialist team and work directly with customers and the account teams. They interface with customers and engage their Framework specialist counterparts when clarification or further expertise is required for in-depth technical discussions or troubleshooting. With this layered structure, AWS can scale technical guidance effectively across complex FM training workloads.

Another critical success factor for GENIAC was establishing robust communication channels between customers and AWS members. The foundation of our communication strategy was a dedicated internal Slack channel for GENIAC program coordination, connecting AWS account teams with lead SAs. This channel enabled real-time troubleshooting, knowledge sharing, and rapid escalation of customer issues to the appropriate technical specialists and service team members. Complementing this was an external Slack channel that bridged AWS teams with customers, creating a collaborative environment where participants could ask questions, share insights, and receive immediate support. This direct line of communication significantly reduced resolution times and fostered a community of practice among participants.

AWS maintained comprehensive workload tracking documents, which clarifies each customer’s training implementation details (model architecture, distributed training frameworks, and related software components) alongside infrastructure specifications (instance types and quantities, cluster configurations for AWS ParallelCluster or SageMaker HyperPod deployments, and storage solutions including Amazon FSx for Lustre and Amazon S3). This tracking system also maintained a chronological history of customer interactions and support cases. In addition, the engagement team held weekly review meetings to track outstanding customer inquiries and technical issues. This regular cadence made it possible for team members to share lessons learned and apply them to their own customer engagements, fostering continuous improvement and knowledge transfer across the program.

With a structured approach to communication and documentation, we could identify common challenges, such as misconfigured NCCL library impacting multi-node performance, share solutions across teams, and continuously refine our engagement model. The detailed tracking system provided valuable insights for future GENIAC cycles, helping us anticipate customer needs and proactively address potential bottlenecks in the FM development process.

Reference architectures

Another early takeaway was the importance of solid reference architectures. Rather than let each team configure their own cluster from scratch, AWS created pre-validated templates and automation for two main approaches: AWS ParallelCluster (for a user-managed HPC cluster) and SageMaker HyperPod (for a managed, resilient cluster service). These reference architectures covered the full stack—from compute, network, and storage to container environments and monitoring—and were delivered as a GitHub repository so teams could deploy them with minimal friction.

AWS ParallelCluster proved invaluable as an open source cluster management tool for multi-node GPU training. AWS ParallelCluster automates the setup of a Slurm-based HPC cluster on AWS. AWS ParallelCluster simplifies cluster provisioning based on the open source Slurm scheduler, using a simple YAML config to stand up the environment. For the GEINIAC program, AWS also offered SageMaker HyperPod as another option for some teams. SageMaker HyperPod is a managed service that provisions GPU and Trainium clusters for large-scale ML. HyperPod integrates with orchestrators like Slurm or Kubernetes (Amazon EKS) for scheduling, providing additional managed functionality around cluster resiliency. By including reference architectures for both AWS ParallelCluster and SageMaker HyperPod, the GENIAC program gave participants flexibility—some opted for the fine-grained control of managing their own HPC cluster, whereas others preferred the convenience and resilience of a managed SageMaker HyperPod cluster.

The reference architecture (shown in the following diagram) seamlessly combines compute, networking, storage, and monitoring into an integrated system specifically designed for large-scale FM training.

The base infrastructure stack is available as an AWS CloudFormation template that provisions the complete infrastructure stack with minimal effort. This template automatically configures a dedicated virtual private cloud (VPC) with optimized networking settings and implements a high-performance FSx for Lustre file system for training data (complemented by optional Amazon FSx for OpenZFS support for shared home directories). The architecture is completed with an S3 bucket that provides durable, long-term storage for datasets and model checkpoints, maintaining data availability well beyond individual training cycles. This reference architecture employs a hierarchical storage approach that balances performance and cost-effectiveness. It uses Amazon S3 for durable, long-term storage of training data and checkpoints, and links this bucket to the Lustre file system through a data repository association (DRA). The DRA enables automatic and transparent data transfer between Amazon S3 and FSx for Lustre, allowing high-performance access without manual copying. You can use the following CloudFormation template to create the S3 bucket used in this architecture.

The optional monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana (or self-managed Grafana service running on Amazon EC2) to provide comprehensive observability. It integrated DCGM Exporter for GPU metrics and EFA Exporter for network metrics, enabling real-time monitoring of system health and performance. This setup allows for continuous tracking of GPU health, network performance, and training progress, with automated alerting for anomalies through Grafana Dashboards. For example, the GPU Health Dashboard (see the following screenshot) provides metrics of common GPU errors, including Uncorrectable Remapped Rows, Correctable Remapped Rows, XID Error Codes, Row Remap Failure, Thermal Violations, and Missing GPUs (from Nvidia-SMI), helping users identify hardware failures as quickly as possible.

Reproducible deployment guides and structured enablement sessions

Even the best reference architectures are only useful if teams know how to use them. A critical element of GENIAC’s success was reproducible deployment guides and structured enablement through workshops.On October 3, 2024, AWS Japan and the WWSO Frameworks team conducted a mass enablement session for GENIAC Cycle 2 participants, inviting Frameworks team members from the United States to share best practices for FM training on AWS.

The enablement session welcomed over 80 participants and provided a comprehensive mix of lectures, hands-on labs, and group discussions—earning a CSAT score of 4.75, reflecting its strong impact and relevance to attendees. The lecture sessions covered infrastructure fundamentals, exploring orchestration options such as AWS ParallelCluster, Amazon EKS, and SageMaker HyperPod, along with the software components necessary to build and train large-scale FMs using AWS. The sessions highlighted practical challenges in FM development—including massive compute requirements, scalable networking, and high-throughput storage—and mapped them to appropriate AWS services and best practices. (For more information, see the slide deck from the lecture session.) Another session focused on best practices, where attendees learned to set up performance dashboards with Prometheus and Grafana, monitor EFA traffic, and troubleshoot GPU failures using NVIDIA’s DCGM toolkit and custom Grafana dashboards based on the Frameworks team’s experience managing a cluster with 2,000 P5 instances.

Additionally, the WWSO team prepared workshops for both AWS ParallelCluster (Machine Learning on AWS ParallelCluster) and SageMaker HyperPod (Amazon SageMaker HyperPod Workshop), providing detailed deployment guides for the aforementioned reference architecture. Using these materials, participants conducted hands-on exercises deploying their training clusters using Slurm with file systems including FSx for Lustre and FSx for OpenZFS, running multi-node PyTorch distributed training. Another segment of the workshop focused on observability and performance tuning, teaching participants how to monitor resource utilization, network throughput (EFA traffic), and system health. By the end of these enablement sessions, customers and supporting AWS engineers had established a shared baseline of knowledge and a toolkit of best practices. Using the assets and knowledge gained during the workshops, customers participated in onboarding sessions—structured, hands-on meetings with their Lead SAs. These sessions differed from the earlier workshops by focusing on customer-specific cluster deployments tailored to each team’s unique use case. During each session, Lead SAs worked directly with teams to deploy training environments, validate setup using NCCL tests, and resolve technical issues in real time.

Customer feedback

“To fundamentally solve data entry challenges, we significantly improved processing accuracy and cost-efficiency by applying two-stage reasoning and autonomous learning with SLM and LLM for regular items, and visual learning with VLM using 100,000 synthetic data samples for detailed items. We also utilized Amazon EC2 P5 instances to enhance research and development efficiency. These ambitious initiatives were made possible thanks to the support of many people, including AWS. We are deeply grateful for their extensive support.”

– Takuma Inoue, Executive Officer, CTO at AI Inside

“Future chose AWS to develop large-scale language models specialized for Japanese and software development at GENIAC. When training large-scale models using multiple nodes, Future had concerns about environment settings such as inter-node communication, but AWS had a wide range of tools, such as AWS ParallelCluster, and we received strong support from AWS Solutions Architects, which enabled us to start large-scale training quickly.”

– Makoto Morishita, Chief Research Engineer at Future

Results and looking ahead

GENIAC has demonstrated that training FMs at scale is fundamentally an organizational challenge, not merely a hardware one. Through structured support, reproducible templates, and a cross-functional engagement team (WWSO Frameworks Team, Lead SAs, and Account Teams), even small teams can successfully execute massive workloads in the cloud. Thanks to this structure, 12 customers launched over 127 P5 instances and 24 Trn1 instances across multiple AWS Regions, including Asia Pacific (Tokyo), in a single day. Multiple large language models (LLMs) and custom models were trained successfully, including a 32B multimodal model on Trainium and a 405B tourism-focused multilingual model.The technical engagement framework established through GENIAC Cycle 2 has provided crucial insights into large-scale FM development. Building on this experience, AWS is advancing improvements across multiple dimensions: engagement models, technical assets, and implementation guidance. We are strengthening cross-functional collaboration and systematizing knowledge sharing to establish a more efficient support structure. Reference architectures and automated training templates continue to be enhanced, and practical technical workshops and best practices are being codified based on lessons learned.AWS has already begun preparations for the next cycle of GENIAC. As part of the onboarding process, AWS hosted a comprehensive technical event in Tokyo on April 3, 2025, to equip FM builders with hands-on experience and architectural guidance. The event, attended by over 50 participants, showcased the commitment AWS has to supporting scalable, resilient generative AI infrastructure.

The event highlighted the technical engagement model of AWS for GENIAC, alongside other support mechanisms, including the LLM Development Support Program and Generative AI Accelerator. The day featured an intensive workshop on SageMaker HyperPod and Slurm, where participants gained hands-on experience with multi-node GPU clusters, distributed PyTorch training, and observability tools. Sessions covered essential topics, including containerized ML, distributed training strategies, and AWS purpose-built silicon solutions. Classmethod Inc. shared practical SageMaker HyperPod insights, and AWS engineers demonstrated architectural patterns for large-scale GPU workloads. The event showcased AWS’s end-to-end generative AI support landscape, from infrastructure to deployment tools, setting the stage for GENIAC Cycle 3. As AWS continues to expand its support for FM development, the success of GENIAC serves as a blueprint for enabling organizations to build and scale their AI capabilities effectively.

Through these initiatives, AWS will continue to provide robust technical support, facilitating the smooth execution of large-scale FM training. We remain committed to contributing to the advancement of generative AI development all over the world through our technical expertise.

This post was contributed by AWS GENIAC Cycle 2 core members Masato Kobayashi, Kenta Ichiyanagi, and Satoshi Shirasawa, Accelerated Computing Specialist Mai Kiuchi, as well as Lead SAs Daisuke Miyamoto, Yoshitaka Haribara, Kei Sasaki, Soh Ohara, and Hiroshi Tokoyo with Executive Sponsorship from Toshi Yasuda. Hiroshi Hata and Tatsuya Urabe also provided support as core member and Lead SA during their time at AWS.

The authors extend their gratitude to WWSO Frameworks members Maxime Hugues, Matthew Nightingale, Aman Shanbhag, Alex Iankoulski, Anoop Saha, Yashesh Shroff, Natarajan Chennimalai Kumar, Shubha Kumbadakone, and Sundar Ranganathan for their technical contributions. Pierre-Yves Aquilanti provided in-depth support during his time at AWS.


About the authors

Keita Watanabe is a Senior Specialist Solutions Architect on the AWS WWSO Frameworks team. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. He leads GENIAC technical engagements.

Masaru Isaka is a Principal Business Development on the AWS WWSO Frameworks team, specializing in machine learning and generative AI solutions. Having engaged with GENIAC since its inception, he leads go-to-market strategies for AWS’s generative AI offerings.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AWS GENIAC Generative AI Foundation Model AI训练
相关文章