MarkTechPost@AI 27分钟前
Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

谷歌DeepMind、密歇根大学和布朗大学的研究人员开发了一种名为“Motion Prompting”的新方法,通过特定的运动轨迹来控制视频生成。这项技术使用运动提示,将用户的高级请求转化为详细的运动指令,从而实现对视频的精确控制。该模型能够执行多种任务,包括对象和摄像机控制、运动转移以及交互式图像编辑,无需针对特定功能进行重新训练,为广告、电影制作和互动娱乐等市场带来了新的可能性。

🎬 **Motion Prompt的核心:** 是一种灵活的运动表示方法,可以捕捉从头发的微小颤动到复杂的摄像机运动的各种动作。研究人员使用时空稀疏或密集的运动轨迹来表示运动。

🖱️ **Motion Prompt Expansion:** 将用户的高级指令(如鼠标拖动)转换为模型所需的详细运动提示。这使得用户能够通过简单的交互来控制视频中的动作。

⚙️ **多功能应用:** 该模型能够执行多种任务,包括精确的对象和摄像机控制、运动转移和交互式图像编辑。例如,用户可以通过拖动鼠标来使图像中的物体移动,或者将一个视频的动作转移到另一个视频中的不同对象上。

🔬 **实验验证:** 研究团队通过定量评估和人类研究验证了Motion Prompting的有效性,其在图像质量和运动精度方面均优于现有模型。人类研究表明,用户更喜欢Motion Prompting生成的视频,因为它能更好地遵循运动指令,呈现更真实的运动效果,并具有更高的整体视觉质量。

Key Takeaways:

As generative AI continues to evolve, gaining precise control over video creation is a critical hurdle for its widespread adoption in markets like advertising, filmmaking, and interactive entertainment. While text prompts have been the primary method of control, they often fall short in specifying the nuanced, dynamic movements that make video compelling. A new paper, presented and highlighted at CVPR 2025, from Google DeepMind, the University of Michigan, and Brown University introduces a groundbreaking solution called “Motion Prompting,” which offers an unprecedented level of control by allowing users to direct the action in a video using motion trajectories.

This new approach moves beyond the limitations of text, which struggles to describe complex movements accurately. For instance, a prompt like “a bear quickly turns its head” is open to countless interpretations. How fast is “quickly”? What is the exact path of the head’s movement? Motion Prompting addresses this by allowing creators to define the motion itself, opening the door for more expressive and intentional video content.

Please note the results are not real time ( 10min processing time) 

Introducing Motion Prompts

At the core of this research is the concept of a “motion prompt.” The researchers identified that spatio-temporally sparse or dense motion trajectories—essentially tracking the movement of points over time—are an ideal way to represent any kind of motion. This flexible format can capture anything from the subtle flutter of hair to complex camera movements.

To enable this, the team trained a ControlNet adapter on top of a powerful, pre-trained video diffusion model called Lumiere. The ControlNet was trained on a massive internal dataset of 2.2 million videos, each with detailed motion tracks extracted by an algorithm called BootsTAP. This diverse training allows the model to understand and generate a vast range of motions without specialized engineering for each task.

From Simple Clicks to Complex Scenes: Motion Prompt Expansion

While specifying every point of motion for a complex scene would be impractical for a user, the researchers developed a process they call “motion prompt expansion.” This clever system translates simple, high-level user inputs into the detailed, semi-dense motion prompts the model needs.

This allows for a variety of intuitive applications:

“Interacting” with an Image: A user can simply click and drag their mouse across an object in a still image to make it move. For example, a user could drag a parrot’s head to make it turn, or “play” with a person’s hair, and the model generates a realistic video of that action. Interestingly, this process revealed emergent behaviors, where the model would generate physically plausible motion, like sand realistically scattering when “pushed” by the cursor.

Object and Camera Control: By interpreting mouse movements as instructions to manipulate a geometric primitive (like an invisible sphere), users can achieve fine-grained control, such as precisely rotating a cat’s head. Similarly, the system can generate sophisticated camera movements, like orbiting a scene, by estimating the scene’s depth from the first frame and projecting a desired camera path onto it. The model can even combine these prompts to control an object and the camera simultaneously. 

Motion Transfer: This technique allows the motion from a source video to be applied to a completely different subject in a static image. For instance, the researchers demonstrated transferring the head movements of a person onto a macaque, effectively “puppeteering” the animal. 

Putting it to the Test

The team conducted extensive quantitative evaluations and human studies to validate their approach, comparing it against recent models like Image Conductor and DragAnything. In nearly all metrics, including image quality (PSNR, SSIM) and motion accuracy (EPE), their model outperformed the baselines. 

A human study further confirmed these results. When asked to choose between videos generated by Motion Prompting and other methods, participants consistently preferred the results from the new model, citing better adherence to the motion commands, more realistic motion, and higher overall visual quality.

Limitations and Future Directions

The researchers are transparent about the system’s current limitations. Sometimes the model can produce unnatural results, like stretching an object unnaturally if parts of it are mistakenly “locked” to the background. However, they suggest that these very failures can be used as a valuable tool to probe the underlying video model and identify weaknesses in its “understanding” of the physical world.

This research represents a significant step toward creating truly interactive and controllable generative video models. By focusing on the fundamental element of motion, the team has unlocked a versatile and powerful tool that could one day become a standard for professionals and creatives looking to harness the full potential of AI in video production.


Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Motion Prompting 视频生成 人工智能 谷歌DeepMind
相关文章