ThursdAI - Recaps of the most high signal AI weekly spaces 2024年10月22日
? ThursdAI - Oct 17 - Robots, Rockets, and Multi Modal Mania with open source voice cloning, OpenAI new voice API and more AI news
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本周AI领域动态丰富,涵盖现实世界的机器人与火箭,开源LLMs及工具的新进展,以及AI的多种非标准用例等。包括Optimus机器人、SpaceX的Starship发射、多种新模型和技术的推出,以及AI在语音克隆、视频编辑等方面的应用。

🎉现实世界的惊人进展:Optimus机器人能在人群中行走、服务,自动驾驶出租车展现无人驾驶未来,SpaceX用Mechazilla捕获Starship booster。

💻开源LLMs及工具的新成果:NVIDIA推出Nemotron 70B instruct模型,Zyphra发布Zamba 2、Zyda 2,Mistral推出Ministral 3B和8B,Entropix引入新采样技术,Google发布Gemma - APS。

🤖AI的非标准用例:Hrishi用Gemini进行转录和日记化实验,发现其不仅效果好且成本低;通过屏幕录制和Gemini Flash可实现低成本数据提取;NotebookLM升级,可通过自定义命令控制扬声器。

🎙语音克隆与视频编辑的发展:F5 - TTS实现零样本语音克隆,Hallo 2能制作会说话的动画头像,Adobe的Firefly Video可生成视频和编辑图像。

Hey folks, Alex here from Weights & Biases, and this week has been absolutely bonkers. From robots walking among us to rockets landing on chopsticks (well, almost), the future is feeling palpably closer. And if real-world robots and reusable spaceship boosters weren't enough, the open-source AI community has been cooking, dropping new models and techniques faster than a Starship launch. So buckle up, grab your space helmet and noise-canceling headphones (we’ll get to why those are important!), and let's blast off into this week’s AI adventures!

TL;DR and show-notes + links at the end of the post ?

Robots and Rockets: A Glimpse into the Future

I gotta start with the real-world stuff because, let's be honest, it's mind-blowing. We had Robert Scoble (yes, the Robert Scoble) join us after attending the Tesla We, Robot AI event, reporting on Optimus robots strolling through crowds, serving drinks, and generally being ridiculously futuristic. Autonomous robo-taxis were also cruising around, giving us a taste of a driverless future.

Robert’s enthusiasm was infectious: "It was a vision of the future, and from that standpoint, it succeeded wonderfully." I couldn't agree more. While the market might have had a mini-meltdown (apparently investors aren't ready for robot butlers yet), the sheer audacity of Tesla’s vision is exhilarating. These robots aren't just cool gadgets; they represent a fundamental shift in how we interact with technology and the world around us. And they’re learning fast. Just days after the event, Tesla released a video of Optimus operating autonomously, showcasing the rapid progress they’re making.

And speaking of audacious visions, SpaceX decided to one-up everyone (including themselves) by launching Starship and catching the booster with Mechazilla – their giant robotic chopsticks (okay, technically a launch tower, but you get the picture). Waking up early with my daughter to watch this live was pure magic. As Ryan Carson put it, "It was magical watching this… my kid who's 16… all of his friends are getting their imaginations lit by this experience." That’s exactly what we need - more imagination and less doomerism! The future is coming whether we like it or not, and I, for one, am excited.

Open Source LLMs and Tools: The Community Delivers (Again!)

Okay, back to the virtual world (for now). This week's open-source scene was electric, with new model releases and tools that have everyone buzzing (and benchmarking like crazy!).

? OpenAI adds voice to their completion API (X, Docs)

In the last second of the pod, OpenAI decided to grace us with Breaking News!

Not only did they launch their Windows native app, but also added voice input and output to their completion APIs. This seems to be the same model as the advanced voice mode (and priced super expensively as well) and the one they used in RealTime API released a few weeks ago at DevDay.

This is of course a bit slower than RealTime but is much simpler to use, and gives way more developers access to this incredible resource (I'm definitely planning to use this for ... things ?)

This isn't their "TTS" or "STT (whisper) models, no, this is an actual omni model that understands audio natively and also outputs audio natively, allowing for things like "count to 10 super slow"

I've played with it just now (and now it's after 6pm and I'm still writing this newsletter) and it's so so awesome, I expect it to be huge because the RealTime API is very curbersome and many people don't really need this complexity.

This weeks Buzz - Weights & Biases updates

Ok I wanted to send a completely different update, but what I will show you is, Weave, our observability framework is now also Multi Modal!

This couples very well with the new update from OpenAI!

So here's an example usage with today's announcement, I'm going to go through the OpenAI example and show you how to use it with streaming so you can get the audio faster, and show you the Weave multimodality as well ?

You can find the code for this in this Gist and please give us feedback as this is brand new

Non standard use-cases of AI corner

This week I started noticing and collecting some incredible use-cases of Gemini and it's long context and multimodality and wanted to share with you guys, so we had some incredible conversations about non-standard use cases that are pushing the boundaries of what's possible with LLMs.

Hrishi blew me away with his experiments using Gemini for transcription and diarization. Turns out, Gemini is not only great at transcription (it beats whisper!), it’s also ridiculously cheaper than dedicated ASR models like Whisper, around 60x cheaper! He emphasized the unexplored potential of prompting multimodal models, adding, “the prompting on these things… is still poorly understood." So much room for innovation here!

then stole the show with his mind-bending screen-scraping technique. He recorded a video of himself clicking through emails, fed it to Gemini Flash, and got perfect structured data in return. This trick isn’t just clever; it’s practically free, thanks to the ridiculously low cost of Gemini Flash. I even tried it myself, recording my X bookmarks and getting a near-perfect TLDR of the week’s AI news. The future of data extraction is here, and it involves screen recordings and very cheap (or free) LLMs.

Here's Simon's example of how much this would cost him had he actually be charged for it. ?

Speaking of , he broke the news that NotebookLM has got an upgrade, with the ability to steer the speakers with custom commands, which Simon promptly used to ask the overview hosts to talk like Pelicans

Voice Cloning, Adobe Magic, and the Quest for Real-Time Avatars

Voice cloning also took center stage this week, with the release of F5-TTS. This open-source model performs zero-shot voice cloning with just a few seconds of audio, raising all sorts of ethical questions (and exciting possibilities!). I played a sample on the show, and it was surprisingly convincing (though not without it's problems) for a local model!

This, combined with Hallo 2's (also released this week!) ability to animate talking avatars, has Wolfram Ravenwolf dreaming of real-time AI assistants with personalized faces and voices. The pieces are falling into place, folks.

And for all you Adobe fans, Firefly Video has landed! This “commercially safe” text-to-video and image-to-video model is seamlessly integrated into Premiere, offering incredible features like extending video clips with AI-generated frames. Photoshop also got some Firefly love, with mind-bending relighting capabilities that could make AI-generated images indistinguishable from real photographs.

Wrapping Up:

Phew, that was a marathon, not a sprint! From robots to rockets, open source to proprietary, and voice cloning to video editing, this week has been a wild ride through the ever-evolving landscape of AI. Thanks for joining me on this adventure, and as always, keep exploring, keep building, and keep pushing those AI boundaries. The future is coming, and it’s going to be amazing.

P.S. Don’t forget to subscribe to the podcast and newsletter for more AI goodness, and if you’re in Seattle next week, come say hi at the AI Tinkerers meetup. I’ll be demoing my Halloween AI toy – it’s gonna be spooky!

ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

TL;DR - Show Notes and Links

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI动态 机器人 开源模型 语音克隆 视频编辑
相关文章