Unite.AI 03月21日 20:47
Better Generative AI Video by Shuffling Frames During Training
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了FluxFlow技术,该技术是一种新的预处理方案,旨在解决AI视频生成中常见的时间异常问题。这些问题包括视频生成过程中出现的突然加速、合并、省略或关键时刻的混乱。FluxFlow通过在训练过程中打乱时间帧的顺序,从而提高模型的泛化能力,减少了生成视频中的闪烁、帧之间的突兀切换以及重复或过于简单的运动模式。实验结果表明,FluxFlow在多种视频生成架构上都有显著效果,为AI视频生成技术提供了改进方案。

🎬 AI视频生成器常遇到时间异常问题,如画面加速、跳帧等,影响视频质量。

🔄 FluxFlow是一种数据预处理技术,通过打乱训练数据的时间帧顺序来解决这些问题。

💡 FluxFlow的核心思想是引入时间扰动,使模型学习更真实的动态变化,而非过度依赖训练数据的简单时间模式。

🔬 研究人员在多个视频生成架构上测试了FluxFlow,结果显示其在提升时间质量的同时,保持了空间保真度。

✅ FluxFlow技术无需改变现有视频生成模型的架构,即可作为即插即用的增强方案。

A new paper out this week at Arxiv addresses an issue which anyone who has adopted the Hunyuan Video or Wan 2.1 AI video generators will have come across by now: temporal aberrations, where the generative process tends to abruptly speed up, conflate, omit, or otherwise mess up crucial moments in a generated video:

Click to play. Some of the temporal glitches that are becoming familiar to users of the new wave of generative video systems, highlighted in the new paper. To the right, the ameliorating effect of the new FluxFlow approach.  Source: https://haroldchen19.github.io/FluxFlow/

The video above features excerpts from example test videos at the (be warned: rather chaotic) project site for the paper. We can see several increasingly familiar issues being remediated by the authors' method (pictured on the right in the video), which is effectively a dataset preprocessing technique applicable to any generative video architecture.

In the first example, featuring ‘two children playing with a ball', generated by CogVideoX, we see (on the left in the compilation video above and in the specific example below) that the native generation rapidly jumps through several essential micro-movements, speeding the children's activity up to a ‘cartoon' pitch. By contrast, the same dataset and method yield better results with the new preprocessing technique, dubbed FluxFlow (to the right of the image in video below):

Click to play.

In the second example (using NOVA-0.6B) we see that a central motion involving a cat has in some way been corrupted or significantly under-sampled at the training stage, to the point that the generative system becomes ‘paralyzed' and is unable to make the subject move:

Click to play.

This syndrome, where the motion or subject gets ‘stuck', is one of the most frequently-reported bugbears of HV and Wan, in the various image and video synthesis groups.

Some of these problems are related to video captioning issues in the source dataset, which we took a look at this week; but the authors of the new work focus their efforts on the temporal qualities of the training data instead, and make a convincing argument that addressing the challenges from that perspective can yield useful results.

As mentioned in the earlier article about video captioning, certain sports are particularly difficult to distil into key moments, meaning that critical events (such as a slam-dunk) do not get the attention they need at training time:

Click to play.

In the above example, the generative system does not know how to get to the next stage of movement, and transits illogically from one pose to the next, changing the attitude and geometry of the player in the process.

These are large movements that got lost in training – but equally vulnerable are far smaller but pivotal movements, such as the flapping of a butterfly's wings:

Click to play.  

Unlike the slam-dunk, the flapping of the wings is not a ‘rare' but rather a persistent and monotonous event. However, its consistency is lost in the sampling process, since the movement is so rapid that it is very difficult to establish temporally.

These are not particularly new issues, but they are receiving greater attention now that powerful generative video models are available to enthusiasts for local installation and free generation.

The communities at Reddit and Discord have initially treated these issues as ‘user-related'. This is an understandable presumption, since the systems in question are very new and minimally documented. Therefore various pundits have suggested diverse (and not always effective) remedies for some of the glitches documented here, such as altering the settings in various components of diverse types of ComfyUI workflows for Hunyuan Video (HV) and Wan 2.1.

In some cases, rather than producing rapid motion, both HV and Wan will produce slow motion. Suggestions from Reddit and ChatGPT (which mostly leverages Reddit) include changing the number of frames in the requested generation, or radically lowering the frame rate*.

This is all desperate stuff; the emerging truth is that we don't yet know the exact cause or the exact remedy for these issues; clearly, tormenting the generation settings to work around them (particularly when this degrades output quality, for instance with a too-low fps rate) is only a short-stop, and it's good to see that the research scene is addressing emerging issues this quickly.

So, besides this week's look at how captioning affects training, let's take a look at the new paper about temporal regularization, and what improvements it might offer the current generative video scene.

The central idea is rather simple and slight, and none the worse for that; nonetheless the paper is somewhat padded in order to reach the prescribed eight pages, and we will skip over this padding as necessary.

The fish in the native generation of the VideoCrafter framework is static, while the FluxFlow-altered version captures the requisite changes. Source: https://arxiv.org/pdf/2503.15417

The new work is titled Temporal Regularization Makes Your Video Generator Stronger, and comes from eight researchers across Everlyn AI, Hong Kong University of Science and Technology (HKUST), the University of Central Florida (UCF), and The University of Hong Kong (HKU).

(at the time of writing, there are some issues with the paper's accompanying project site)

FluxFlow

The central idea behind FluxFlow, the authors' new pre-training schema, is to overcome the widespread problems flickering and temporal inconsistency by shuffling blocks and groups of blocks in the temporal frame orders as the source data is exposed to the training process:

The central idea behind FluxFlow is to move blocks and groups of blocks into unexpected and non-temporal positions, as a form of data augmentation.

The paper explains:

‘[Artifacts] stem from a fundamental limitation: despite leveraging large-scale datasets, current models often rely on simplified temporal patterns in the training data (e.g., fixed walking directions or repetitive frame transitions) rather than learning diverse and plausible temporal dynamics.

‘This issue is further exacerbated by the lack of explicit temporal augmentation during training, leaving models prone to overfitting to spurious temporal correlations (e.g., “frame #5 must follow #4”) rather than generalizing across diverse motion scenarios.'

Most video generation models, the authors explain, still borrow too heavily from image synthesis, focusing on spatial fidelity while largely ignoring the temporal axis. Though techniques such as cropping, flipping, and color jittering have helped improve static image quality, they are not adequate solutions when applied to videos, where the illusion of motion depends on consistent transitions across frames.

The resulting problems include flickering textures, jarring cuts between frames, and repetitive or overly simplistic motion patterns.

Click to play.

The paper argues that though some models – including Stable Video Diffusion and LlamaGen – compensate with increasingly complex architectures or engineered constraints, these come at a cost in terms of compute and flexibility.

Since temporal data augmentation has already proven useful in video understanding tasks (in frameworks such as FineCliper, SeFAR and SVFormer) it is surprising, the authors assert, that this tactic is rarely applied in a generative context.

Disruptive Behavior

The researchers contend that simple, structured disruptions in temporal order during training help models generalize better to realistic, diverse motion:

‘By training on disordered sequences, the generator learns to recover plausible trajectories, effectively regularizing temporal entropy. FLUXFLOW bridges the gap between discriminative and generative temporal augmentation, offering a plug-and-play enhancement solution for temporally plausible video generation while improving overall [quality].

‘Unlike existing methods that introduce architectural changes or rely on post-processing, FLUXFLOW operates directly at the data level, introducing controlled temporal perturbations during training.'

Click to play.

Frame-level perturbations, the authors state, introduce fine-grained disruptions within a sequence. This kind of disruption is not dissimilar to masking augmentation, where sections of data are randomly blocked out, to prevent the system overfitting on data points, and encouraging better generalization.

Tests

Though the central idea here doesn't run to a full-length paper, due to its simplicity, nonetheless there is a test section that we can take a look at.

The authors tested for four queries relating to improved temporal quality while maintaining spatial fidelity; ability to learn motion/optical flow dynamics; maintaining temporal quality in extraterm generation; and sensitivity to key hyperparameters.

The researchers applied FluxFlow to three generative architectures: U-Net-based, in the form of VideoCrafter2; DiT-based, in the form of CogVideoX-2B; and AR-based, in the form of NOVA-0.6B.

For fair comparison, they fine-tuned the architectures' base models with FluxFlow as an additional training phase, for one epoch, on the OpenVidHD-0.4M dataset.

The models were evaluated against two popular benchmarks: UCF-101; and VBench.

For UCF, the Fréchet Video Distance (FVD) and Inception Score (IS) metrics were used. For VBench, the researchers concentrated on temporal quality, frame-wise quality, and overall quality.

Quantitative initial Evaluation of FluxFlow-Frame. “+ Original” indicates training without FLUXFLOW, while “+ Num × 1” shows different FluxFlow-Frame configurations. Best results are shaded; second-best are underlined for each model.

Commenting on these results, the authors state:

‘Both FLUXFLOW-FRAME and FLUXFLOW-BLOCK significantly improve temporal quality, as evidenced by the metrics in Tabs. 1, 2 (i.e., FVD, Subject, Flicker, Motion, and Dynamic) and qualitative results in [image below].

‘For instance, the motion of the drifting car in VC2, the cat chasing its tail in NOVA, and the surfer riding a wave in CVX become noticeably more fluid with FLUXFLOW. Importantly, these temporal improvements are achieved without sacrificing spatial fidelity, as evidenced by the sharp details of water splashes, smoke trails, and wave textures, along with spatial and overall fidelity metrics.'

Below we see selections from the qualitative results the authors refer to (please see the original paper for full results and better resolution):

Selections from the qualitative results.

The paper suggests that while both frame-level and block-level perturbations enhance temporal quality, frame-level methods tend to perform better. This is attributed to their finer granularity, which enables more precise temporal adjustments. Block-level perturbations, by contrast, may introduce noise due to tightly coupled spatial and temporal patterns within blocks, reducing their effectiveness.

Conclusion

This paper, along with the Bytedance-Tsinghua captioning collaboration released this week, has made it clear to me that the apparent shortcomings in the new generation of generative video models may not result from user error, institutional missteps, or funding limitations, but rather from a research focus that has understandably prioritized more urgent challenges, such as temporal coherence and consistency, over these lesser concerns.

Until recently, the results from freely-available and downloadable generative video systems were so compromised that no great locus of effort emerged from the enthusiast community to redress the issues (not least because the issues were fundamental and not trivially solvable).

Now that we are so much closer to the long-predicted age of purely AI-generated photorealistic video output, it's clear that both the research and casual communities are taking a deeper and more productive interest in resolving remaining issues; with any luck, these are not intractable obstacles.

 

* Wan's native frame rate is a paltry 16fps, and in response to my own issues, I note that forums have suggested lowering the frame rate as low as 12fps, and then using FlowFrames or other AI-based re-flowing systems to interpolate the gaps between such a sparse number of frames.

First published Friday, March 21, 2025

The post Better Generative AI Video by Shuffling Frames During Training appeared first on Unite.AI.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

FluxFlow AI视频生成 时间异常 视频质量 深度学习
相关文章