ThursdAI - Recaps of the most high signal AI weekly spaces 02月28日
? Feb 27, 2025 - GPT-4.5 Drops TODAY?!, Claude 3.7 Coding BEAST, Grok's Unhinged Voice, Humanlike AI voices & more AI news
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本周AI领域迎来重大进展,OpenAI发布了参数量巨大的GPT-4.5,尽管其在基准测试中表现并非最强,但在创造性写作、医学诊断等方面有所提升。Anthropic推出了Claude 3.7 Sonnet,在代码生成方面表现出色,成为新的王者。开源LLM也迎来爆发,DeepSeek开源了一系列先进工具和技术,微软发布了Phi-4系列模型,展示了小型模型的多模态能力和效率。此外,基于扩散模型的LLM如Mercury Coder和LLaDA也崭露头角,在生成速度上实现了突破。OpenAI的Deep Research工具向Plus用户开放,Amazon推出了LLM驱动的Alexa+,以及Grok推出了带有“脱缰”模式的语音功能。

🚀 **GPT-4.5 (ORION)震撼发布**: OpenAI推出了其有史以来最大的LLM,参数规模是GPT-4的10倍。尽管在基准测试中表现并非最突出,但它在创造性写作、推荐歌曲、改进视觉能力和医学诊断方面表现更佳。该模型支持文本和图像输入,文本输出,拥有128K tokens的上下文窗口,并已在ChatGPT Pro和API中提供。

💻 **Claude 3.7 Sonnet:新一代代码王者**: Anthropic发布的Claude 3.7 Sonnet在代码生成方面表现卓越,在SWE-Bench基准测试中取得了70%的惊人成绩,并迅速赢得了社区的积极反馈。它在WebDev arena中排名第一,经过UX和网站训练,引入了思考和推理能力。

💡 **开源LLM百花齐放**: DeepSeek开源了FlashMLA、DeepEP、DeepGEMM等一系列先进工具和技术,用于训练和优化LLM。微软发布了Phi-4-multimodal和Phi-4-mini,展示了小型模型在多模态处理和效率方面的潜力。Mercury Coder和LLaDA等扩散模型LLM的出现,为文本生成带来了新的可能性,尤其是在速度方面。

🤖 **Alexa+迎来AI大脑升级**: Amazon推出了Alexa+,利用Anthropic的Claude模型进行升级,旨在提供更智能、更具会话性的体验,并与Amazon服务进行深度集成。用户可以期待Alexa在处理复杂对话、控制智能家居设备和执行跨Amazon服务的任务方面有显著提升。

🎤 **Grok语音模式:放飞自我**: Elon Musk的AI Grok新增了语音模式,甚至包含一个“脱缰”的18+选项,允许AI以粗俗的语言进行交流。这一功能引发了广泛关注和讨论。

Hey all, Alex here ?

What can I say, the weeks are getting busier , and this is one of those "crazy full" weeks in AI. As we were about to start recording, OpenAI teased GPT 4.5 live stream, and we already had a very busy show lined up (Claude 3.7 vibes are immaculate, Grok got an unhinged voice mode) and I had an interview with Kevin Hou from Windsurf scheduled! Let's dive in!

? GPT 4.5 (ORION) is here - worlds largest LLM (10x GPT4o)

OpenAI has finally shipped their next .5 model, which is 10x scale from the previous model. We didn't cover this on the podcast but did watch the OpenAI live stream together after the podcast concluded.

A very interesting .5 release from OpenAI, where even Sam Altman says "this model won't crush on benchmarks" and is not the most frontier model, but is OpenAI's LARGEST model by far (folks are speculating 10+ Trillions of parameters)

After 2 years of smaller models and distillations, we finally got a new BIG model, that shows scaling laws proper, and while on some benchmarks it won't compete against reasoning models, this model will absolutely fuel a huge increase in capabilities even for reasoners, once o-series models will be trained on top of this.

Here's a summary of the announcement and quick vibes recap (from folks who had access to it before)

4.5 Vibes Recap

Tons of folks who had access are pointing to the same thing, while this model is not beating others on evals, it's much better at multiple other things, namely creative writing, recommending songs, improved vision capability and improved medical diagnosis.

Karpathy said "Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to" and posted a thread of pairwise comparisons of tone on his X thread

Though the reaction is bifurcated as many are upset with the high price of this model (10x more costly on outputs) and the fact that it's just marginally better at coding tasks. Compared to the newerSonnet (Sonnet 3.7) and DeepSeek, folks are looking at OpenAI and asking, why isn't this way better?

Anthropic's Claude 3.7 Sonnet: A Coding Powerhouse

Anthropic released Claude 3.7 Sonnet, and the immediate reaction from the community was overwhelmingly positive. With 8x more output capability (64K) and reasoning built in, this model is an absolute coding powerhouse.

Claude 3.7 Sonnet is the new king of coding models, achieving a remarkable 70% on the challenging SWE-Bench benchmark, and the initial user feedback is stellar, though vibes started to shift a bit towards Thursday.

Ranking #1 on WebDev arena, and seemingly trained on UX and websites, Claude Sonnet 3.7 (AKA NewerSonner) has been blowing our collective minds since it was released on Monday, especially due to introducing Thinking and reasoning in a combined model.

Now, since the start of the week, the community actually had time to play with it, and some of them return to sonnet 3.5 and saying that while the model is generally much more capable, it tends to generate tons of things that are unnecessary.

I wonder if the shift is due to Cursor/Windsurf specific prompts, or the model's larger output context, and we'll keep you updated on if the vibes shift again.

Open Source LLMs

This week was HUGE for open source, folks. We saw releases pushing the boundaries of speed, multimodality, and even the very way LLMs generate text!

DeepSeek's Open Source Spree

DeepSeek went on an absolute tear, open-sourcing a treasure trove of advanced tools and techniques:

This isn't your average open-source dump, folks. We're talking FlashMLA (efficient decoding on Hopper GPUs), DeepEP (an optimized communication library for MoE models), DeepGEMM (an FP8 GEMM library that's apparently ridiculously fast), and even parallelism strategies like DualPipe and EPLB.

They are releasing some advanced stuff for training and optimization of LLMs, you can follow all their releases on their X account

Dual Pipe seems to be the one that got most attention from the community, which is an incredible feat in pipe parallelism, that even got the cofounder of HuggingFace super excited

Microsoft's Phi-4: Multimodal and Mini (Blog, HuggingFace)

Microsoft joined the party with Phi-4-multimodal (5.6B parameters) and Phi-4-mini (3.8B parameters), showing that small models can pack a serious punch.

These models are a big deal. Phi-4-multimodal can process text, images, and audio, and it actually beats WhisperV3 on transcription! As Nisten said, "This is a new model and, I'm still reserving judgment until, until I tried it, but it looks ideal for, for a portable size that you can run on the phone and it's multimodal." It even supports a wide range of languages. Phi-4-mini, on the other hand, is all about speed and efficiency, perfect for finetuning.

Diffusion LLMs: Mercury Coder and LLaDA (X , Try it)

This is where things get really interesting. We saw not one, but two diffusion-based LLMs this week: Mercury Coder from Inception Labs and LLaDA 8B. (Although, ok, to be fair, LLaDa released 2 weeks ago I was just busy)

For those who don't know, diffusion is usually used for creating things like images. The idea of using it to generate text is like saying, "Okay, there's a revolutionary tool for painting; I'll write the code using it." Inception Labs' Mercury Coder is claiming over 1000 tokens per second on NVIDIA H100s – that's insane speed, usually only seen with specialized chips! Nisten spent hours digging into these, noting, "This is a complete breakthrough and, it just hasn't quite hit yet that this just happened because people thought for a while it should be possible because then you can do, you can do multiple token prediction at once". He explained that these models combine a regular LLM with a diffusion component, allowing them to generate multiple tokens simultaneously and excel at tasks like "fill in the middle" coding.

LLaDA 8B, on the other hand, is an open-source attempt, and while it needs more training, it shows the potential of this approach. LDJ pointed out that LLaDA is "trained on like around five times or seven times less data while already like competing with LLAMA3 AP with same parameter count".

Are diffusion LLMs the future? It's too early to say, but the speed gains are very intriguing.

Magma 8B: Robotics LLM from Microsoft

Microsoft dropped Magma 8B, a Microsoft Research project, an open-source model that combines vision and language understanding with the ability to control robotic actions.

Nisten was particularly hyped about this one, calling it "the robotics. LLM." He sees it as a potential game-changer for robotics companies, allowing them to build robots that can understand visual input, respond to language commands, and act in the real world.

OpenAI's Deep Research for Everyone (Well, Plus Subscribers)

OpenAI finally brought Deep Research, its incredible web-browsing and research tool, to Plus subscribers.

I've been saying this for a while: Deep Research is another ChatGPT moment. It's that good. It goes out, visits websites, understands your query in context, and synthesizes information like nothing else. As Nisten put it, "Nothing comes close to OpenAI's Deep Research...People like pull actual economics data, pull actual stuff." If you haven't tried it, you absolutely should.

Our full coverage of Deep Research is here if you haven't yet listened, it's incredible.

Alexa Gets an AI Brain Upgrade with Alexa+

Amazon finally announced Alexa+, the long-awaited LLM-powered upgrade to its ubiquitous voice assistant.

Alexa+ will be powered by Claude (and sometimes Nova), offering a much more conversational and intelligent experience, with integrations across Amazon services.

This is a huge deal. For years, Alexa has felt… well, dumb, compared to the advancements in LLMs. Now, it's getting a serious intelligence boost, thanks to Anthropic's Claude. It'll be able to handle complex conversations, control smart home devices, and even perform tasks across various Amazon services. Imagine asking Alexa, "Did I let the dog out today?" and it actually checking your Ring camera footage to give you an answer! (Although, as I joked, let's hope it doesn't start setting houses on fire.)

Also very intriguing is the new SDKs they are releasing to connect Alexa+ to all kinds of experience, I think this is huge and will absolutely create a new industry of applications built for voice Alexa.

Alexa Web Actions for example will allow Alexa to navigate to a website and complete actions (think order Uber Eats)

The price? 20$/mo but free if you're a Amazon Prime subscriber, which is most of the US households at this point.

They are focusing on personalization and memory, though still unclear how that's going to be handled, and the ability to share documents like schedules

I'm very much looking forward to smart Alexa, and to be able to say "Alexa, set a timer for the amount of time it takes to hard boil an egg, and flash my house lights when the timer is done"

Grok Gets a Voice... and It's UNHINGED

Grok, Elon Musk's AI, finally got a voice mode, and… well, it's something else.

One-sentence summary: Grok's new voice mode includes an "unhinged" 18+ option that curses like a sailor, along with other personality settings.

Yes, you read that right. There's literally an "unhinged" setting in the UI. We played it live on the show, and... well, let's just say it's not for the faint of heart (or for kids). Here's a taste:

Alex: "Hey there."

Grok: "Yo, Alex. What's good, you horny bastard? How's your day been so far? Fucked up or just mildly shitty?"

Beyond the shock value, the voice mode is actually quite impressive in its expressiveness and ability to understand interruptions. It has several personalities, from a helpful "Grok Doc" to an "argumentative" mode that will disagree with everything you say. It's... unique.

This Week's Buzz (WandB-Related News)

Agents Course is Coming!

We announced our upcoming agents course! You can pre-sign up HERE . This is going to be a deep dive into building and deploying AI agents, so don't miss it!

AI Engineer Summit Recap

We briefly touched on the AI Engineer Summit in New York, where we met with Kevin Hou and many other brilliant minds in the AI space. The theme was "Agents at Work," and it was a fantastic opportunity to see the latest developments in agent technology. I gave a talk about reasoning agents and had a workshop about evaluations on Saturday, and saw many listeners of ThursdAI ? ✋

Interview with Kevin Hou from Windsurf

This week we had the pleasure of chatting with Kevin Hou from Windsurf about their revolutionary AI editor. Windsurf isn't just another IDE, it's an agentic IDE. As Kevin explained, "we made the pretty bold decision of saying, all right, we're not going to do chat... we are just going to [do] agent." They've built Windsurf from the ground up with an agent-first approach, and it’s making waves.

Kevin walked us through the evolution of AI coding tools, from autocomplete to chat, and now to agents. He highlighted the "magical experiences" users are having, like debugging complex code with AI assistance that actually understands the context. We also delved into the challenges – memory, checkpointing, and cost.

We also talked about the burning question: vibe coding. Is coding as we know it dead? Kevin’s take was nuanced: "there's an in between state that I really vibe or like gel with, which is,the scaffolding of what you want… Let's use, let's like vibe code and purely use the agent to accomplish this sort of commit." He sees AI agents raising the bar for software quality, demanding better UX, testing, and overall polish.

And of course, we had to ask about the elephant in the room – why are so many people switching from Cursor to Windsurf? Kevin's answer was humble, pointing to user experience, the agent-first workflow, and the team’s dedication to building the best product. Check out our full conversation on the pod and download Windsurf for yourself: windsurf.ai

Video Models & Voice model updates

There is so much happening in LLM world, that folks may skip over the other stuff, but there's so much happening in these world's as well this week! Here's a brief recap!


Phew, it looks like we've made it! Huge huge week in AI, big 2 new models, tons of incredible updates on multimodality and voice as well ?

If you enjoyed this summary, the best way to support us is to share with a friend (or 3) and give us a 5 start reviews on wherever you get your podcasts, it really does help! ?

See you next week,

Alex

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPT-4.5 Claude 3.7 Sonnet 开源LLM Alexa+ Grok
相关文章