MarkTechPost@AI 12小时前
This AI Paper Introduces PEVA: A Whole-Body Conditioned Diffusion Model for Predicting Egocentric Video from Human Motion
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了PEVA,一个由加州大学伯克利分校、Meta的FAIR和纽约大学的研究人员开发的新框架,旨在通过全身运动来预测第一人称视角视频。PEVA利用3D全身姿势轨迹,将身体动作与视觉感知联系起来,从而使机器能够像人类一样预测和理解环境。该模型采用条件扩散变换器,并在Nymeria数据集上进行训练,结果表明,PEVA在短期和长期的视频预测方面都优于现有模型,为具身智能系统提供了更准确、物理上更可靠的预测能力。

🧍研究的核心在于探索身体运动与视觉感知之间的联系,这对于开发能够理解和交互环境的智能系统至关重要。PEVA着重研究了人体动作(如行走和手臂运动)如何影响第一人称视角下的视觉呈现。

🤔PEVA面临的主要挑战在于,如何教会系统理解身体动作对视觉的影响。例如,转头或弯腰等动作会以微妙的方式改变视野。PEVA通过将物理运动与视觉输入的变化联系起来,从而解决了这一难题。

💡PEVA框架的核心在于其结构化的动作表示。它使用一个48维向量来描述动作输入,包括根部平移和15个上半身关节在3D空间中的关节级旋转。这种全面的表示方式能够捕捉真实运动的连续性和细微差别。

🎬PEVA被设计为一个自回归扩散模型,使用视频编码器将帧转换为潜在状态表示,并基于先前的状态和身体动作预测后续帧。为了支持长时视频生成,系统在训练期间引入随机时间跳跃,从而学习运动的即时和延迟视觉后果。

✅PEVA在短期和长期的视频预测能力方面都进行了评估。实验结果表明,PEVA在视觉一致性和语义准确性方面表现出色,尤其是在长时间的视频预测中。通过将全身控制纳入模型,视频的真实感和可控性得到了显著提升。

Understanding the Link Between Body Movement and Visual Perception

The study of human visual perception through egocentric views is crucial in developing intelligent systems capable of understanding & interacting with their environment. This area emphasizes how movements of the human body—ranging from locomotion to arm manipulation—shape what is seen from a first-person perspective. Understanding this relationship is essential for enabling machines and robots to plan and act with a human-like sense of visual anticipation, particularly in real-world scenarios where visibility is dynamically influenced by physical motion.

Challenges in Modeling Physically Grounded Perception

A major hurdle in this domain arises from the challenge of teaching systems how body actions affect perception. Actions such as turning or bending change what is visible in subtle and often delayed ways. Capturing this requires more than simply predicting what comes next in a video—it involves linking physical movements to the resulting changes in visual input. Without the ability to interpret and simulate these changes, embodied agents struggle to plan or interact effectively in dynamic environments.

Limitations of Prior Models and the Need for Physical Grounding

Until now, tools designed to predict video from human actions have been limited in scope. Models have often used low-dimensional input, such as velocity or head direction, and overlooked the complexity of whole-body motion. These simplified approaches overlook the fine-grained control and coordination required to simulate human actions accurately. Even in video generation models, body motion has usually been treated as the output rather than the driver of prediction. This lack of physical grounding has restricted the usefulness of these models for real-world planning.

Introducing PEVA: Predicting Egocentric Video from Action

Researchers from UC Berkeley, Meta’s FAIR, and New York University introduced a new framework called PEVA to overcome these limitations. The model predicts future egocentric video frames based on structured full-body motion data, derived from 3D body pose trajectories. PEVA aims to demonstrate how entire-body movements influence what a person sees, thereby grounding the connection between action and perception. The researchers employed a conditional diffusion transformer to learn this mapping and trained it using Nymeria, a large dataset comprising real-world egocentric videos synchronized with full-body motion capture.

Structured Action Representation and Model Architecture

The foundation of PEVA lies in its ability to represent actions in a highly structured manner. Each action input is a 48-dimensional vector that includes the root translation and joint-level rotations across 15 upper body joints in 3D space. This vector is normalized and transformed into a local coordinate frame centered at the pelvis to remove any positional bias. By utilizing this comprehensive representation of body dynamics, the model captures the continuous and nuanced nature of real motion. PEVA is designed as an autoregressive diffusion model that uses a video encoder to convert frames into latent state representations and predicts subsequent frames based on prior states and body actions. To support long-term video generation, the system introduces random time-skips during training, allowing it to learn from both immediate and delayed visual consequences of motion.

Performance Evaluation and Results

In terms of performance, PEVA was evaluated on several metrics that test both short-term and long-term video prediction capabilities. The model was able to generate visually consistent and semantically accurate video frames over extended periods of time. For short-term predictions, evaluated at 2-second intervals, it achieved lower LPIPS scores and higher DreamSim consistency compared to baselines, indicating superior perceptual quality. The system also decomposed human movement into atomic actions such as arm movements and body rotations to assess fine-grained control. Furthermore, the model was tested on extended rollouts of up to 16 seconds, successfully simulating delayed outcomes while maintaining sequence coherence. These experiments confirmed that incorporating full-body control led to substantial improvements in video realism and controllability.

Conclusion: Toward Physically Grounded Embodied Intelligence

This research highlights a significant advancement in predicting future egocentric video by grounding the model in physical human movement. The problem of linking whole-body action to visual outcomes is addressed with a technically robust method that uses structured pose representations and diffusion-based learning. The solution introduced by the team offers a promising direction for embodied AI systems that require accurate, physically grounded foresight.


Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post This AI Paper Introduces PEVA: A Whole-Body Conditioned Diffusion Model for Predicting Egocentric Video from Human Motion appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

PEVA 第一人称视角视频 全身运动 扩散模型
相关文章