Nvidia Developer 02月16日
Accelerate Custom Video Foundation Model Pipelines with New NVIDIA NeMo Framework Capabilities
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA NeMo框架推出全新视频基础模型功能,助力各行业革新。该框架提供从数据处理到模型训练、推理的全流程解决方案,包括高效数据筛选、多模态数据加载、可扩展模型训练以及并行推理。NeMo Curator通过优化数据管道,显著提升生成AI模型精度,处理大规模视频数据集的速度提升89倍。框架支持自回归和扩散模型,并针对视频扩散模型进行了并行优化,如高效流水线并行和时空DiT架构,实现高性能训练和推理。

🚀 NeMo Curator通过NVDEC、NVENC和Ray等技术,优化数据管道,实现高吞吐量的视频数据筛选,处理速度比CPU上的非优化管道快89倍,大大缩短了视频数据处理时间。

💽 NeMo框架采用Megatron-Energon数据加载器,通过分片大规模数据、确定性保存和加载、序列打包等技术,实现高效的多模态数据加载,减少I/O开销,保证训练的一致性,并最大限度地减少计算浪费。

💡 NeMo框架支持多种模型并行技术,包括张量并行、序列并行、流水线并行和上下文并行,并针对视频扩散模型面临的挑战,采用高效的流水线并行处理条件信息、支持时空DiT架构以及定制化的随机种子机制,实现了可扩展且高性能的训练。

🤖 NVIDIA Cosmos 是一个世界基础模型平台,加速了人形机器人和自动驾驶汽车等物理人工智能应用的开发。

Generative AI has evolved from text-based models to multimodal models, with a recent expansion into video, opening up new potential uses across various industries. Video models can create new experiences for users or simulate scenarios for training autonomous agents at scale. They are helping revolutionize various industries including robotics, autonomous vehicles, and entertainment. The development of video foundation models presents unique challenges due to the vast and varied nature of video data. This also underscores the necessity of scalable pipelines for curating data and effectively training models that can comprehend temporal and spatial dynamics.We are announcing brand new video foundation model capabilities in the NVIDIA NeMo framework, an end-to-end training framework that enables you to pretrain and fine-tune your own video foundation models. The framework includes a high-throughput data curation, efficient multimodal data loading functionality, scalable model training, and a parallelized in-framework inference. Video 1. NVIDIA Cosmos is a world foundation model platform that accelerates the development of physical AI applications like humanoid robots and autonomous vehicles.High-throughput video curation through optimized pipelinesNeMo Curator improves generative AI model accuracy by efficiently processing and preparing high-quality data, including large video datasets. Using NeMo Curator’s scalable data pipelines, you can efficiently clip, annotate, and filter 100 PB or more of videos. To remove bottlenecks and optimize performance, NeMo Curator uses the following combination:NVDEC: Hardware decoderNVENC: Hardware encoderRay: Compute framework for scaling AI applicationsThe NeMo Curator autobalancing techniques can leverage heterogeneous clusters with multiple GPU types to take advantage of NVENC on L40S GPUs and the performance of H100 and GB200 GPUs. Figure 1 shows how NeMo Curator can process 20M hours of video data, reducing the processing time from years to days, achieving 89x speed up using 1K GPUs compared to unoptimized pipelines on CPUs for ISO power usage.Figure 1. NeMo Curator delivers 89x faster video data processingNeMo Curator provides the following relevant pipelines for video foundation model training and fine-tuning datasets: The clipping pipeline starts with decoding and splitting raw videos into short, continuous clips by analyzing frame-to-frame color changes. The stitching stage smooths the clips out by using image embedding similarities to potentially merge adjacent clips together. These clips are then transcoded to the high-quality video encoding (H264), and they are annotated with video embeddings and captions, either existing or synthetically generated by a VLM, to facilitate semantic search capabilities. Figure 2. Video curation clipping and sharding pipelinesSharding generates text embeddings for captions to create the final WebDataset used for training. NeMo Curator also uses Ray streaming to build an auto-balancing system and deploy an optimal number of workers for each stage in the pipeline to avoid being bottlenecked by any stage (Figure 3). Figure 3. Auto-balancing system to match the throughput of the overall pipelineEfficient multimodal dataloadingVideo models can be trained on billions of images and millions of videos, necessitating an efficient data loading strategy to achieve high throughput during training time. This is accomplished in the NeMo framework through the use of Megatron-Energon data loader:Shard large-scale data: Uses the WebDataset format to shard a TB-size dataset into compressed files to help reduce I/O overhead during training. Deterministic save and load: Enables the dataset to be visited in one pass without repetition when the training job is disrupted, ensuring consistency across different training cluster setups.  Sequence packing: Packs variable length or resolution images and videos together up to the max sequence length, minimizing compute wastage due to padding while simplifying data loading logic. NeMo uses the special THD attention kernel from the Transformer engine to support accelerated training with sequence packing. Figure 4. Mixed image-video training with sequence packingReduce network bandwidth strain: Each model parallel rank downloads a different subset of data instead of the whole dataset, and then all-gathers the data across ranks to get an identical dataloader.Figure 5. Reducing network bandwidth strain to improve training throughputScaling video foundation model trainingVideo foundation models can be either autoregressive or diffusion models. The well-established suite of NeMo tools on large language models (LLMs) can be reused for autoregressive models, while support for diffusion transformers such as DiT, MovieGen, and the latest NVIDIA Cosmos world foundation models for physical AI have been newly added.The NeMo tech stack is highly optimized and provides more than 40% Model FLOPs utilization (MFU) in the latest benchmark (Table 1).Model sizeContext lengthTraining configGPU used (TFLOPS/s)Throughput (token/s/GPU)DiT 7B8kbaseline, no optimizationOOM DiT 7B8kCP=24578,969DiT 7B74kTP=4 SP CP=44142,933DiT 28B8kTP=2 SP PP=24352,392DiT 28B74kTP=8 SP CP=4 PP=4411994Table 1. GPU utilization and throughput benchmark for NVIDIA NeMo framework on diffusion transformers (DiT)Legend: CP=context parallelism; TP=tensor parallelism; SP=sequence parallelism; PP=pipeline parallelismOverview of the video diffusion pipelineA video diffusion training pipeline is generally composed of the following major steps:Tokenize the input image and video with a causal temporal 3D tokenizer to generate 3D spatio-temporal tokens. Use a transformer decoder conditioned by the diffusion noise schedule timestep t and text input.Timestep conditioning is applied through an Adaptive LayerNormalization (AdaLN) mechanism, with an option to use AdaLN-LoRA, which further improves Model FLOPs Utilization (MFU) during training. Text conditioning is applied through a cross attention layer in each transformer block. The NeMo framework enables you to initialize your transformer decoder based on the canonical DiT architecture or the MovieGen Llama architecture, which uses Grouped-Query Attention (GQA). Compute the diffusion loss with the parallelized EDM diffusion pipeline using the noise prediction from the diffusion transformer.NeMo also applies additional Root Mean Square Layer Normalization (RMSNorm) on the queries and keys before attention blocks to stabilize diffusion training. RMSNorm is applied per attention head to remain compatible with tensor parallelism. Figure 6. NeMo video diffusion training pipelineParallelism optimizations for video diffusion modelsNeMo and Megatron-Core enable various model parallelism techniques:Tensor parallel (TP)Sequence parallel (SP)Pipeline parallel (PP)Context parallel (CP)However, these techniques face unique challenges when applied to video diffusion transformers. Here’s how NeMo solves these challenges to achieve scalable and performant training:Efficient pipeline parallelism for conditioningSupport for Spatio-Temporal DiT (ST-DiT) architectureCustomized random seeding mechanismThe traditional approach is to communicate conditioning information across pipeline stages, incurring additional communication cost and requiring nontrivial modifications to the pipeline schedule. NeMo solves this problem by computing the conditional embeddings at each pipeline stage. The computation cost with efficient pipeline parallelism for conditioning is much less than the communication cost and improves training throughput.Figure 7. Trading communication for compute in conditioning pipeline parallelismThe Spatio-Temporal DiT (ST-DiT) architecture introduces additional spatial and temporal self-attention layers to each transformer block, as an alternative to training with full self attention on long video sequences. This approach exposes communication overhead during context parallelism due to smaller compute over short input sequence for these layers. NeMo addresses this by using local attention computation with A2A communication for spatial/temporal attention, while maintaining P2P ring topology for full self-attention. The hybrid approach effectively reduces bandwidth needs for temporal/spatial attention while still benefiting from context parallelism over full self-attention layer (Table 2).Figure 8. Spatial-temporal DiT transformer blockLayerInput SeqCommunication primitiveCommunication bandwidthTemporal self-attentionShort seqLocal compute & A2A(bhw/cp, t, d)Spatial self-attentionShort seqLocal compute & A2A(bt/cp, hw, d)Full attentionLong seqCP with P2P(b, hwt/cp, d)Table 2. NeMo communication strategies for each kind of layerLegend: b=batch size; hw=spatial size; t=temporal size; cp=context parallel size; d=hidden size, with input size being (b, th*w, d).The customized random seeding mechanism goal is to make sure that random seeds are correctly initialized across the following components: Time stepGaussian noiseThe actual model weightsTable 3 shows NeMo’s initialization strategy. RNG seedData parallelContext parallelPipeline parallelTensor parallelTime step (t)DiffSameSameSameGaussian noiseDiffDiffSameSameWeight initializationSameSameDiffDiffTable 3. Customized random seeding for parallelized diffusion transformersLegend: Diff=Different random seed from other parallel ranks; Same=Same random seed as other parallel ranks.Efficient in-framework inferenceThe NeMo framework accelerates inference by distributing denoising operations across multiple GPUs through context parallelism. After parallel denoising, the latent tensors are combined to reconstruct the video sequence before decoding with the Cosmos video tokenizer. Benchmarks show 80–90% scaling efficiency on up to 32 H100 GPUs, with FP8 Multi-Head Attention providing 28% and 48% performance improvements over BF16 on 1 and 32 GPUs respectively. Figure 9. Parallelized video generation with context parallelismFigure 10. Inference performance at different GPU count​sConclusionIn this post, we covered all the features of NVIDIA NeMo framework that will help you pretrain or fine-tune video foundation models in an effective and efficient manner. NeMo Curator offers high-throughput data curation through clipping and sharding pipelines, and the Megatron Energon library offers efficient multimodal data loading. NeMo Frameworks enables scalable video foundation model training by supporting various model parallelism techniques specially optimized on diffusion and autoregressive models. In addition, it provides efficient in-framework inference by distributing denoising operations across multiple GPUs and incorporating FP8 Multi-Head Attention. You can curate your video data with NeMo Curator early access program, tokenize them, pre-train(diffusion, autoregressive), fine-tune (diffusion, autoregressive), and perform multi-GPU in-framework inference (diffusion, autoregressive)with NeMo Framework  today.You can also try the NVIDIA Cosmos world foundation models at build.nvidia.com and watch the CES keynote from NVIDIA CEO Jensen Huang to learn more about the NVIDIA Cosmos world foundation model platform.AcknowledgementsThanks to the following contributors: Parth Mannan, Xiaowei Ren, Zhuoyao Wang, Carl Wang, Jack Chang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Linnan Wang, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang, Vasanth Rao Naik Sabavat, Sriharsha Niverty, Rong Ou, Pallab Bhattacharya, David Page, Jacob Huffman, Tommy Huang, Nima Tajbakhsh, and Ashwath Aithal.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA NeMo 视频生成 AI框架 深度学习
相关文章