本期的 15 篇论文如下:
[00:23] ? Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia(众包、爬取还是生成?创建东南亚视觉语言数据集SEA-VL)
[01:04] ? LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL(LMM-R1:通过两阶段基于规则的强化学习赋予3B参数大模态模型强大的推理能力)
[01:43] ? YuE: Scaling Open Foundation Models for Long-Form Music Generation(YuE:扩展开放基础模型用于长篇音乐生成)
[02:17] ? Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models(UniF²ace:基于统一多模态模型的细粒度人脸理解和生成)
[02:59] ? MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice(MagicInfinite:用你的文字和声音生成无限对话视频)
[03:42] ? SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories(SegAgent:通过模仿人类标注者轨迹探索多模态大模型的像素理解能力)
[04:19] ? Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model(Seedream 2.0:一种中英双语图像生成基础模型)
[05:03] ? Gemini Embedding: Generalizable Embeddings from Gemini(双子座嵌入:从双子座模型中获得可泛化的嵌入)
[05:45] ? Implicit Reasoning in Transformers is Reasoning through Shortcuts(Transformer中的隐式推理是通过捷径实现的)
[06:21] ? LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization(LightGen:通过知识蒸馏和直接偏好优化实现高效图像生成)
[07:06] ? Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling(无需调参的多事件长视频生成通过同步耦合采样)
[07:44] ? Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning(通过元强化微调优化测试时计算)
[08:30] ? OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models(OmniMamba:基于线性架构的高效统一多模态理解和生成模型)
[09:14] ? CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing(CineBrain:自然视听叙事处理中的大规模多模态脑数据集)
[09:52] ? Video Action Differencing(视频动作差异分析)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递