Latent 01月15日
Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文总结了NeurIPS大会上关于生成模型、扩散模型、视觉、强化学习和机器人等多个领域的前沿技术进展。重点介绍了Sora、Genie和VideoPoet等视频生成模型,以及扩散模型在图像和语音合成中的应用。同时,文章还探讨了视觉基础模型和多模态大型语言模型的发展,并深入研究了强化学习和机器人技术在自主学习和环境塑造方面的最新突破。此外,文章还提及了ICML会议上的一些重要研究成果,为读者提供了全面的技术概览。

🎬Sora、Genie和VideoPoet等视频生成模型成为焦点,展示了在生成交互式环境和零样本视频方面的强大能力。Sora的inductive biases引入问题引发关注,而Genie和VideoPoet在ICML会议上获得最佳论文奖,彰显了Google DeepMind在该领域的领先地位。

🔊扩散模型在图像和语音合成领域展现出巨大潜力。Sander Dieleman的“频谱自回归”视角为理解扩散模型提供了新思路,而Stable Diffusion 3的Scaling Rectified Flow Transformers技术则推动了高分辨率图像合成的发展。NaturalSpeech 3和语音自监督学习也利用扩散模型取得了显著进展。

👁️视觉领域涌现出许多创新成果,包括DeCAF被誉为“OG视觉基础模型”,以及Lucas Beyer在“LLM时代的视觉”中提出的数据中心视角。此外,多模态大型语言模型在视觉语言基准测试中的表现也受到了关注。

🤖强化学习和机器人技术方面,Ashley Edwards在DeepMind团队的工作展示了从视频中学习动作、策略、奖励和环境的可能性。Chelsea Finn关于机器人学习的四次演讲则强调了机器人自主适应和为语言模型提供反馈的重要性。最后,自动环境塑造被认为是强化学习的下一个前沿领域。

Regular tickets are now sold out for Latent Space LIVE! at NeurIPS! We have just announced our last speaker and newest track, friend of the pod Nathan Lambert who will be recapping 2024 in Reasoning Models like o1! We opened up a handful of late bird tickets for those who are deciding now — use code DISCORDGANG if you need it. See you in Vancouver!


We’ve been sitting on our ICML recordings for a while (from today’s first-ever SOLO guest cohost, Brittany Walker), and in light of Sora Turbo’s launch (blogpost, tutorials) today, we figured it would be a good time to drop part one which had been gearing up to be a deep dive into the state of generative video worldsim, with a seamless transition to vision (the opposite modality), and finally robots (their ultimate application).

Sora, Genie, and the field of Generative Video World Simulators

Bill Peebles, author of Diffusion Transformers, gave his most recent Sora talk at ICML, which begins our episode:

Something that is often asked about Sora is how much inductive biases were introduced to achieve these results. Bill references the same principles brought by Hyung Won Chung from the o1 team - “sooner or later those biases come back to bite you”.

We also recommend these reads from throughout 2024 on Sora.

Google DeepMind had a remarkably strong presence at ICML on Video Generation Models, winning TWO Best Paper awards for:

We end this part by taking in Tali Dekel’s talk on The Future of Video Generation: Beyond Data and Scale.

Part 2: Generative Modeling and Diffusion

Since 2023, Sander Dieleman’s perspectives (blogpost, tweet) on diffusion as “spectral autoregression in the frequency domain” while working on Imagen and Veo have caught the public imagination, so we highlight his talk:

Then we go to Ben Poole for his talk on Inferring 3D Structure with 2D Priors, including his work on NeRFs and DreamFusion:

Then we investigate two flow matching papers - one from the Flow Matching co-authors - Ricky T. Q. Chen (FAIR, Meta)

And how it is implemented in Stable Diffusion 3 with Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Our last hit on Diffusion is a couple of oral presentations on speech, which we leave you to explore via our audio podcast

Part 3: Vision

The ICML Test of Time winner was DeCAF, which Trevor Darrell notably called “the OG vision foundation model”.

Lucas Beyer’s talk on “Vision in the age of LLMs — a data-centric perspective” was also well received online, and he talked about his journey from Vision Transformers to PaliGemma.

We give special honorable mention to MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark.

Part 4: Reinforcement Learning and Robotics

We segue vision into robotics with the help of Ashley Edwards, whose work on both the Gato and the Genie teams at Deepmind is summarized in Learning actions, policies, rewards, and environments from videos alone.

Brittany highlighted two poster session papers:

However we must give the lion’s share of space to Chelsea Finn, now founder of Physical Intelligence, who gave FOUR talks on

We end the podcast with a position paper that links generative environments and RL/robotics: Automatic Environment Shaping is the Next Frontier in RL.

Timestamps

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NeurIPS 生成模型 扩散模型 视觉 强化学习
相关文章