本期的 15 篇论文如下:
00:23 💡 Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning(反思、重试、奖励:通过强化学习实现LLM的自我提升)
01:09 🖼 UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation(UniWorld:用于统一视觉理解与生成的高分辨率语义编码器)
01:53 🧪 CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs(CSVQA:一个用于评估视觉语言模型STEM推理能力的中文多模态基准)
02:37 🤖 VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments(VS-Bench:评估视觉语言模型在多智能体环境中进行战略推理和决策的能力)
03:15 🧠 SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis(SynthRL:利用可验证数据合成扩展视觉推理)
04:01 🧠 OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models(OmniSpatial:面向视觉语言模型的综合空间推理基准)
04:47 🤖 Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces(视觉具身大脑:让多模态大型语言模型在空间中观察、思考和控制)
05:24 👀 MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs(MotionSight:提升多模态大型语言模型中的细粒度运动理解能力)
06:10 🤖 GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents(GUI-Actor:面向GUI代理的无坐标视觉定位)
06:48 🎬 Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers(Sparse-vDiT:释放稀疏注意力以加速视频扩散Transformer)
07:27 🧩 DINGO: Constrained Inference for Diffusion LLMs(DINGO:扩散LLM的约束推理)
08:10 🎬 AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation(AnimeShooter:一个用于参考引导视频生成的多镜头动画数据集)
08:47 🤖 Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics(Robot-R1:用于增强机器人具身推理的强化学习)
09:35 🤖 Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning(基于强化学习的LLM代码生成器与单元测试器协同进化)
10:21 🖼 Native-Resolution Image Synthesis(原生分辨率图像合成)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递