Accelerate Custom Video Foundation Model Pipelines with New NVIDIA NeMo Framework Capabilities

Generative AI has evolved from text-based models to multimodal models, with a recent expansion into video, opening up new potential uses across various industries. Video models can create new experiences for users or simulate scenarios for training autonomous agents at scale. They are helping revolutionize various industries including robotics, autonomous vehicles, and entertainment. The development of video foundation models presents unique challenges due to the vast and varied nature of video data. This also underscores the necessity of scalable pipelines for curating data and effectively training models that can comprehend temporal and spatial dynamics.We are announcing brand new video foundation model capabilities in the NVIDIA NeMo framework, an end-to-end training framework that enables you to pretrain and fine-tune your own video foundation models. The framework includes a high-throughput data curation, efficient multimodal data loading functionality, scalable model training, and a parallelized in-framework inference. Video 1. NVIDIA Cosmos is a world foundation model platform that accelerates the development of physical AI applications like humanoid robots and autonomous vehicles.High-throughput video curation through optimized pipelinesNeMo Curator improves generative AI model accuracy by efficiently processing and preparing high-quality data, including large video datasets. Using NeMo Curator’s scalable data pipelines, you can efficiently clip, annotate, and filter 100 PB or more of videos. To remove bottlenecks and optimize performance, NeMo Curator uses the following combination:NVDEC: Hardware decoderNVENC: Hardware encoderRay: Compute framework for scaling AI applicationsThe NeMo Curator autobalancing techniques can leverage heterogeneous clusters with multiple GPU types to take advantage of NVENC on L40S GPUs and the performance of H100 and GB200 GPUs. Figure 1 shows how NeMo Curator can process 20M hours of video data, reducing the processing time from years to days, achieving 89x speed up using 1K GPUs compared to unoptimized pipelines on CPUs for ISO power usage.Figure 1. NeMo Curator delivers 89x faster video data processingNeMo Curator provides the following relevant pipelines for video foundation model training and fine-tuning datasets: The clipping pipeline starts with decoding and splitting raw videos into short, continuous clips by analyzing frame-to-frame color changes. The stitching stage smooths the clips out by using image embedding similarities to potentially merge adjacent clips together. These clips are then transcoded to the high-quality video encoding (H264), and they are annotated with video embeddings and captions, either existing or synthetically generated by a VLM, to facilitate semantic search capabilities. Figure 2. Video curation clipping and sharding pipelinesSharding generates text embeddings for captions to create the final WebDataset used for training. NeMo Curator also uses Ray streaming to build an auto-balancing system and deploy an optimal number of workers for each stage in the pipeline to avoid being bottlenecked by any stage (Figure 3). Figure 3. Auto-balancing system to match the throughput of the overall pipelineEfficient multimodal dataloadingVideo models can be trained on billions of images and millions of videos, necessitating an efficient data loading strategy to achieve high throughput during training time. This is accomplished in the NeMo framework through the use of Megatron-Energon data loader:Shard large-scale data: Uses the WebDataset format to shard a TB-size dataset into compressed files to help reduce I/O overhead during training. Deterministic save and load: Enables the dataset to be visited in one pass without repetition when the training job is disrupted, ensuring consistency across different training cluster setups. Sequence packing: Packs variable length or resolution images and videos together up to the max sequence length, minimizing compute wastage due to padding while simplifying data loading logic. NeMo uses the special THD attention kernel from the Transformer engine to support accelerated training with sequence packing. Figure 4. Mixed image-video training with sequence packingReduce network bandwidth strain: Each model parallel rank downloads a different subset of data instead of the whole dataset, and then all-gathers the data across ranks to get an identical dataloader.Figure 5. Reducing network bandwidth strain to improve training throughputScaling video foundation model trainingVideo foundation models can be either autoregressive or diffusion models. The well-established suite of NeMo tools on large language models (LLMs) can be reused for autoregressive models, while support for diffusion transformers such as DiT, MovieGen, and the latest NVIDIA Cosmos world foundation models for physical AI have been newly added.The NeMo tech stack is highly optimized and provides more than 40% Model FLOPs utilization (MFU) in the latest benchmark (Table 1).Model sizeContext lengthTraining configGPU used (TFLOPS/s)Throughput (token/s/GPU)DiT 7B8kbaseline, no optimizationOOM DiT 7B8kCP=24578,969DiT 7B74kTP=4 SP CP=44142,933DiT 28B8kTP=2 SP PP=24352,392DiT 28B74kTP=8 SP CP=4 PP=4411994Table 1. GPU utilization and throughput benchmark for NVIDIA NeMo framework on diffusion transformers (DiT)Legend: CP=context parallelism; TP=tensor parallelism; SP=sequence parallelism; PP=pipeline parallelismOverview of the video diffusion pipelineA video diffusion training pipeline is generally composed of the following major steps:Tokenize the input image and video with a causal temporal 3D tokenizer to generate 3D spatio-temporal tokens. Use a transformer decoder conditioned by the diffusion noise schedule timestep t and text input.Timestep conditioning is applied through an Adaptive LayerNormalization (AdaLN) mechanism, with an option to use AdaLN-LoRA, which further improves Model FLOPs Utilization (MFU) during training. Text conditioning is applied through a cross attention layer in each transformer block. The NeMo framework enables you to initialize your transformer decoder based on the canonical DiT architecture or the MovieGen Llama architecture, which uses Grouped-Query Attention (GQA). Compute the diffusion loss with the parallelized EDM diffusion pipeline using the noise prediction from the diffusion transformer.NeMo also applies additional Root Mean Square Layer Normalization (RMSNorm) on the queries and keys before attention blocks to stabilize diffusion training. RMSNorm is applied per attention head to remain compatible with tensor parallelism. Figure 6. NeMo video diffusion training pipelineParallelism optimizations for video diffusion modelsNeMo and Megatron-Core enable various model parallelism techniques:Tensor parallel (TP)Sequence parallel (SP)Pipeline parallel (PP)Context parallel (CP)However, these techniques face unique challenges when applied to video diffusion transformers. Here’s how NeMo solves these challenges to achieve scalable and performant training:Efficient pipeline parallelism for conditioningSupport for Spatio-Temporal DiT (ST-DiT) architectureCustomized random seeding mechanismThe traditional approach is to communicate conditioning information across pipeline stages, incurring additional communication cost and requiring nontrivial modifications to the pipeline schedule. NeMo solves this problem by computing the conditional embeddings at each pipeline stage. The computation cost with efficient pipeline parallelism for conditioning is much less than the communication cost and improves training throughput.Figure 7. Trading communication for compute in conditioning pipeline parallelismThe Spatio-Temporal DiT (ST-DiT) architecture introduces additional spatial and temporal self-attention layers to each transformer block, as an alternative to training with full self attention on long video sequences. This approach exposes communication overhead during context parallelism due to smaller compute over short input sequence for these layers. NeMo addresses this by using local attention computation with A2A communication for spatial/temporal attention, while maintaining P2P ring topology for full self-attention. The hybrid approach effectively reduces bandwidth needs for temporal/spatial attention while still benefiting from context parallelism over full self-attention layer (Table 2).Figure 8. Spatial-temporal DiT transformer blockLayerInput SeqCommunication primitiveCommunication bandwidthTemporal self-attentionShort seqLocal compute & A2A(bhw/cp, t, d)Spatial self-attentionShort seqLocal compute & A2A(bt/cp, hw, d)Full attentionLong seqCP with P2P(b, hwt/cp, d)Table 2. NeMo communication strategies for each kind of layerLegend: b=batch size; hw=spatial size; t=temporal size; cp=context parallel size; d=hidden size, with input size being (b, th*w, d).The customized random seeding mechanism goal is to make sure that random seeds are correctly initialized across the following components: Time stepGaussian noiseThe actual model weightsTable 3 shows NeMo’s initialization strategy. RNG seedData parallelContext parallelPipeline parallelTensor parallelTime step (t)DiffSameSameSameGaussian noiseDiffDiffSameSameWeight initializationSameSameDiffDiffTable 3. Customized random seeding for parallelized diffusion transformersLegend: Diff=Different random seed from other parallel ranks; Same=Same random seed as other parallel ranks.Efficient in-framework inferenceThe NeMo framework accelerates inference by distributing denoising operations across multiple GPUs through context parallelism. After parallel denoising, the latent tensors are combined to reconstruct the video sequence before decoding with the Cosmos video tokenizer. Benchmarks show 80–90% scaling efficiency on up to 32 H100 GPUs, with FP8 Multi-Head Attention providing 28% and 48% performance improvements over BF16 on 1 and 32 GPUs respectively. Figure 9. Parallelized video generation with context parallelismFigure 10. Inference performance at different GPU countsConclusionIn this post, we covered all the features of NVIDIA NeMo framework that will help you pretrain or fine-tune video foundation models in an effective and efficient manner. NeMo Curator offers high-throughput data curation through clipping and sharding pipelines, and the Megatron Energon library offers efficient multimodal data loading. NeMo Frameworks enables scalable video foundation model training by supporting various model parallelism techniques specially optimized on diffusion and autoregressive models. In addition, it provides efficient in-framework inference by distributing denoising operations across multiple GPUs and incorporating FP8 Multi-Head Attention. You can curate your video data with NeMo Curator early access program, tokenize them, pre-train(diffusion, autoregressive), fine-tune (diffusion, autoregressive), and perform multi-GPU in-framework inference (diffusion, autoregressive)with NeMo Framework today.You can also try the NVIDIA Cosmos world foundation models at build.nvidia.com and watch the CES keynote from NVIDIA CEO Jensen Huang to learn more about the NVIDIA Cosmos world foundation model platform.AcknowledgementsThanks to the following contributors: Parth Mannan, Xiaowei Ren, Zhuoyao Wang, Carl Wang, Jack Chang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Linnan Wang, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang, Vasanth Rao Naik Sabavat, Sriharsha Niverty, Rong Ou, Pallab Bhattacharya, David Page, Jacob Huffman, Tommy Huang, Nima Tajbakhsh, and Ashwath Aithal.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签