Full World simulation just had its ImageNet moment

Coding with Intelligence 2024年11月04日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文概述了近期人工智能领域的多个重要进展，包括Decart推出的可玩Minecraft模拟环境Oasis，展现了模拟完整世界环境的能力；OpenAI发布的SimpleQA基准测试，旨在创建更具代表性的基准数据集；Recraft v3图像生成模型在Artificial Analysis竞技场中排名第一；以及其他一些关于语言模型、语音合成、视觉语言模型等方面的研究成果，如SmolLM2、MrT5、Very Attentive Tacotron、Transfusion等。这些进展涵盖了模拟环境、图像生成、语言理解、语音合成、视觉语言模型等多个领域，预示着人工智能技术在各个领域的快速发展，并可能带来深远的影响。

🤔**Decart推出Oasis：可玩Minecraft模拟环境** Decart公司发布了名为Oasis的可玩Minecraft模拟环境，这是一个令人惊叹的成就，标志着AI在模拟完整世界方面取得了重大突破。Oasis能够以越来越高的保真度模拟完整的世界环境，这将开启一个机器人技术和现实世界推理的新时代，其影响难以完全预测。Oasis不仅提供了模型权重，还提供了一个交互式网页演示，用户可以通过有限的队列进行体验。这表明，AI模拟完整世界环境的能力正在快速发展，未来可能会在游戏、机器人、虚拟现实等领域带来广泛的应用。例如，Oasis可以用于训练机器人，使其能够在复杂的环境中进行导航和操作，也可以用于创建更加逼真和沉浸式的虚拟现实体验。同时，Oasis的发布也表明，AI模型的规模和复杂度正在不断增加，这将对计算资源和数据存储提出更高的要求。

🖼️**Recraft v3：功能强大的图像生成模型** Recraft v3是目前最强大的闭源图像生成模型之一，它在Artificial Analysis竞技场的文本到图像生成领域排名第一。Recraft v3还推出了API，方便开发者将其集成到自己的应用中。这意味着，图像生成技术正在不断发展，并且越来越容易被开发者和用户使用。例如，Recraft v3可以用于生成各种类型的图像，包括艺术作品、产品设计、游戏场景等。同时，Recraft v3的出现也表明，AI模型的性能正在不断提升，未来可能会生成更加逼真和高质量的图像。此外，Recraft v3的API也为开发者提供了更多可能性，例如，开发者可以利用Recraft v3的API来构建自己的图像生成应用，或者将图像生成功能集成到其他应用中。

🤖**SmolLM2：强大的小型语言模型** SmolLM2是由Hugging Face的Loubna Ben Allal开发的一个强大的17亿参数的小型语言模型（SLM）。在多个类别中，SmolLM2的表现都超过了Qwen2.5-1.5B。这表明，小型语言模型在性能方面正在不断提升，并且在某些任务中可以与大型语言模型相媲美。例如，SmolLM2可以用于各种自然语言处理任务，例如文本分类、问答、机器翻译等。由于SmolLM2的模型参数较少，因此它可以在资源受限的设备上运行，例如手机和平板电脑。这使得小型语言模型在边缘计算和移动设备上的应用成为可能。此外，SmolLM2的成功也表明，研究人员正在探索开发更高效和更易于部署的语言模型的方法。

📚**MrT5：高效的字节级语言模型** 斯坦福大学的研究人员发布了MrT5论文，提出了一种无需分词的语言模型方法。该方法旨在解决分词成为语言模型瓶颈的问题。MrT5使用动态令牌合并技术，可以有效地处理字节级语言模型。这表明，研究人员正在探索开发更有效和更灵活的语言模型方法。例如，MrT5可以用于各种自然语言处理任务，例如文本分类、问答、机器翻译等。由于MrT5无需分词，因此它可以处理各种语言和字符集，包括中文、日文、韩文等。此外，MrT5的出现也表明，语言模型的研究正在不断深入，未来可能会开发出更加高效和强大的语言模型。

🎤**Very Attentive Tacotron：鲁棒且长度不受限制的语音合成模型** 在自动回归Transformer语音合成模型中，往往会出现单词丢失、重复或输出不稳定的现象，尤其是在处理较长语音时。这篇论文提出了增强自动回归Transformer编码器-解码器语音合成系统的方法，以解决这些鲁棒性和长度泛化问题。这表明，研究人员正在努力提高语音合成模型的鲁棒性和泛化能力。例如，Very Attentive Tacotron可以用于各种语音合成任务，例如语音助手、有声读物、语音翻译等。由于Very Attentive Tacotron可以处理较长的语音，因此它可以用于生成更自然的语音输出。此外，Very Attentive Tacotron的出现也表明，语音合成技术正在不断发展，未来可能会生成更加自然和逼真的语音。

I don’t believe folks are grasping the implications of this achievement yet. But the ability to simulate full world environments at increasingly higher levels of fidelity will usher in an era of robotics and real-world reasoning the consequences of which are hard to fully comprehend. Incredible work from the folks at Decart.

? News

OpenAI introduces SimpleQA benchmark

In an attempt to curate unsaturated benchmarks. Note, OpenAI has still done significantly more for open source than Anthropic. Something to ponder about :) OpenAI's o1-preview gets ~42% and interestingly also refuses to answer (instead of just hallucinating).

Recraft v3: most powerful (closed) image generation model

They also launch with an API. It's number one on Artificial Analysis arena leaderboard. https://artificialanalysis.ai/text-to-image/arena?tab=Leaderboard

? Repos

Embedding inference engine

Very cool project by Michael Feil from Gradient AI.

Progress on o1 repro: MCTSr: Mathematic as a Blackbox for LLM

Early stage of a project attempting to reproduce o1, good source of raw ideas if you're working on this yourself.

SmolLM2: powerful 1.7B SLM (small language model)

Great model by Loubna Ben Allal from Hugging Face. Beats Qwen2.5-1.5B in multiple categories.

Stagehand: an AI web browsing framework by Browserbase

? Papers

MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

I've been very interested in a tokenization-free approach to LLMs and this paper from Stanford nails it. Check this out if you think tokenizers are bottlenecking LLMs too!

Bayesian scaling laws for in-context learning

Interesting approach to modeling scaling laws for In-context Learning ability of LLMs.

Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

"When applied to text-to-speech (TTS), these models (AR Transformers) tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues."

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

By Meta AI.

Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

Useful new benchmarks for VLMs. VLMs are often used for structured extraction in practice, so this benchmarks is not very academic but well aligned with applied quality needs. By Percy Liang's group at Stanford.

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

"reduces the number of video tokens while preserving visual details of long videos" neat! By folks from Meta AI, a video content powerhouse.

Anon ICLR submission: Towards Learning to Reason at Pre-Training Scale

Interesting idea! "given the first tokens from a large pre-training corpus, the model generates a CoT and receives a reward based on how well the CoT helps predict the following tokens"

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

By Google DeepMind.

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Researchers might not have tons of compute, but luckily they are smart. This paper solves the problem "When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch". This computational efficiency gain can spur faster iteration of architectural ideas, Neural-Architecture-Search let's go!

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Very strong open source non-autoregressive TTS model. Demo space on Hugging Face https://huggingface.co/spaces/amphion/maskgct

? Demos

OmniParser running in the browser with Transformer.js

Very impressive and useful demo showing how to run OmniParser in the browser directly. As others remarked on X, this has potential to be a core building block for browser extensions.

Decart launches Oasis: playable simulated Minecraft

This is a phenomenal achievement. It sets the stage for full world simulation. Remember what the first generated images/videos looked like. What makes this launch even more remarkable is that both weights and an interactive web demo with a limited queue are available. Just WOW.

AlignEval: a game/tool to help you build and optimize LLM-evaluator

Very cool project! Source is on GitHub. And this X thread

? Resources

Waymo introduces Emma

It's built on top of Gemini's multimodal capabilities.

LLM as a judge for business value by Hamel Husain

You won't find better applied AI findings than this.

NotebookLM's TTS system explained by Google DeepMind

OpenAI Audio generation endpoint

Note this is separate from real-time audio. It allows these combinations:text in → text + audio outaudio in → text + audio outaudio in → text outtext + audio in → text + audio outtext + audio in → text out

[Video] Learning to Reason, Insights from Language Modeling

By Noah D. Goodman a researcher from Stanford.

Want more? Follow me on X! @ricklamers

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签