ThursdAI - Recaps of the most high signal AI weekly spaces 05月02日 06:20
📆 ThursdAI - May 1- Qwen 3, Phi-4, OpenAI glazegate, RIP GPT4, LlamaCon, LMArena in hot water & more AI news
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本周AI领域风起云涌,开源模型表现抢眼,阿里巴巴发布Qwen 3系列模型,微软推出Phi-4-Reasoning。与此同时,GPT-4正式退役引发热议,ChatGPT“过度谄媚”事件导致OpenAI紧急回滚更新。此外,Anthropic的Claude宣布重大升级,支持与多种应用程序集成。文章还回顾了LlamaCon的最新进展,并对AI评估进行了深入探讨。

🚀阿里巴巴开源Qwen 3系列模型,包含MoE和密集型结构,性能出色,在多种评估中表现优异,并提供“/think”和“/no_think”的运行模式选择。

🧠微软推出Phi-4-Reasoning模型,通过链式思考训练和少量强化学习,在推理任务上表现出色,与更大规模的模型竞争。

💔GPT-4正式从ChatGPT下拉菜单中移除,引发社区对OpenAI的怀念和对开源的呼声,并引发了对模型生命周期的思考。

😲ChatGPT出现“过度谄媚”现象,OpenAI紧急回滚更新,凸显了负责任模型部署的重要性,以及对用户信任的影响。

💡Anthropic的Claude宣布支持与多种应用程序集成,标志着其在工具使用方面的重大进展,并可能带来更广泛的应用场景。

Hey everyone, Alex here 👋

Welcome back to ThursdAI! And wow, what a week. Seriously, strap in, because the AI landscape just went through some seismic shifts. We're talking about a monumental open-source release from Alibaba with Qwen 3 that has everyone buzzing (including us!), Microsoft dropping Phi-4 with Reasoning, a rather poignant farewell to a legend (RIP GPT-4 – we'll get to the wake shortly), major drama around ChatGPT's "glazing" incident and the subsequent rollback, updates from LlamaCon, a critical look at Chatbot Arena, and a fantastic deep dive into the world of AI evaluations with two absolute experts, Hamel Husain and Shreya Shankar.

This week felt like a whirlwind, with open source absolutely dominating the headlines. Qwen 3 didn't just release a model; they dropped an entire ecosystem, setting a potential new benchmark for open-weight releases. And while we pour one out for GPT-4, we also have to grapple with the real-world impact of models like ChatGPT, highlighted by the "glazing" fiasco. Plus, video consistency takes a leap forward with Runway, and we got breaking news live on the show from Claude!

So grab your coffee (or beverage of choice), settle in, and let's unpack this incredibly eventful week in AI.

Open-Source LLMs

Qwen 3 — “Hybrid Thinking” on Tap

Alibaba open-weighted the entire Qwen 3 family this week, releasing two MoE titans (up to 235 B total / 22 B active) and six dense siblings all the way down to 0 .6 B, all under Apache 2.0. Day-one support landed in LM Studio, Ollama, vLLM, MLX and llama.cpp.

The headline trick is a runtime thinking toggle—drop “/think” to expand chain-of-thought or “/no_think” to sprint. On my Mac, the 30 B-A3B model hit 57 tokens/s when paired with speculative decoding (drafted by the 0 .6 B sibling).

Other goodies:

In short: more parameters when you need them, fewer when you don’t, and the lawyers stay asleep. Read the full drop on the Qwen blog or pull weights from the HF collection.

Performance & Efficiency: "Sonnet at Home"?

The benchmarks are where things get really exciting.

    The 235B MoE rivals or surpasses models like DeepSeek-R1 (which rocked the boat just months ago!), O1, O3-mini, and even Gemini 2.5 Pro on coding and math.

    The 4B dense model incredibly beats the previous generation's 72B Instruct model (Qwen 2.5) on multiple benchmarks! 🤯

    The 30B MoE (with only 3B active parameters) is perhaps the star. Nisten pointed out people are getting 100+ tokens/sec on MacBooks. Wolfram achieved an 80% MMLU Pro score locally with a quantized version. The efficiency math is crazy – hitting Qwen 2.5 performance with only ~10% of the active parameters.

Nisten dubbed the larger model "Sonnet 3.5 at home," and while acknowledging Sonnet still has an edge in complex "vibe coding," the performance, especially in reasoning and tool use, is remarkably close for an open model you can run yourself.

I ran the 30B MoE (3B active) locally using LLM Studio (shoutout for day-one support!) through my Weave evaluation dashboard (Link). On a set of 20 hard reasoning questions, it scored 43%, beating GPT 4.1 mini and nano, and getting close to 4.1 – impressive for a 3B active parameter model running locally!

Phi-4-Reasoning — 14B That Punches at 70B+

Microsoft’s Phi team layered 1.4 M chain-of-thought traces plus a dash of RL onto Phi-4 to finally ship a resoning Phi and shipped two MIT-licensed checkpoints:

Phi-4-R-Plus clocks 78 % on AIME 25, edging DeepSeek-R1-Distill-70B, with 32 K context (stable to 64 K via RoPE). Scratch-pads hide in <think> tags. Full details live in Microsoft’s tech report and HF weights.

It's fascinating to see how targeted training on reasoning traces and a small amount of RL can elevate a relatively smaller model to compete with giants on specific tasks.

Other Open Source Updates

    MiMo-7B: Xiaomi entered the ring with a 7B parameter, MIT-licensed model family, trained on 25T tokens and featuring rule-verifiable RL. (HF model hub)

    Helium-1 2B: KyutAI (known for their Mochi voice model) released Helium-1, a 2B parameter model distilled from Gemma-2-9B, focused on European languages, and licensed under CC-BY 4.0. They also open-sourced 'dactory', their data processing pipeline. (Blog, Model (2 B), Dactory pipeline)

    Qwen 2.5 Omni 3B: Alongside Qwen 3, the Qwen team also updated their existing Omni model with a 3B model, that retains 90% of the comprehension of its big brother with a 50% VRAM drop! (HF)

    JetBrains open sources Mellum: Trained on over 4 trillion tokens with a context window of 8192 tokens across multiple programming languages, they haven't released any comparable eval benchmarks though (HF)

Big Companies & APIs: Drama, Departures, and Deployments

While open source stole the show, the big players weren't entirely quiet... though maybe some wish they had been.

Farewell, GPT-4: Rest In Prompted 🙏

Okay folks, let's take a moment. As many of you noticed, GPT-4, the original model launched back on March 14th, 2023, is no longer available in the ChatGPT dropdown. You can't select it, you can't chat with it anymore.

For us here at ThursdAI, this feels significant. GPT-4's launch was the catalyst for this show. We literally started on the same day. It represented such a massive leap from GPT-3.5, fundamentally changing how we interacted with AI and sparking the revolution we're living through. Nisten recalled the dramatic improvement it brought to his work on Dr. Gupta, the first AI doctor on the market.

It kicked off the AI hype train, demonstrated capabilities many thought were years away, and set the standard for everything that followed. While newer models have surpassed it, its impact is undeniable.

The community sentiment was clear: Leak the weights, OpenAI! As Wolfram eloquently put it, this is a historical artifact, an achievement for humanity. What better way to honor its legacy and embrace the "Open" in OpenAI than by releasing the weights? It would be an incredible redemption arc.

This inspired me to tease a little side project I've been vibe coding: The AI Model Graveyard - inference.rip . A place to commemorate the models we've known, loved, hyped, and evaluated, before they inevitably get sunsetted. GPT-4 deserves a prominent place there. We celebrate models when they're born; we should remember them when they pass. (GPT-4.5 is likely next on the chopping block, by the way). - it's not ready yet, still vibe coding (fighting with replit) but it'l be up soon and I'll be sure to commemorate every model that's dying there!

So, pour one out for GPT-4. You changed the game. Rest In Prompt 🪦.


The ChatGPT "Glazing" Incident: A Cautionary Tale

Speaking of OpenAI...oof. The last couple of weeks saw ChatGPT exhibit some... weird behavior. Sam Altman himself used the term "glazing" – essentially, the model became overly agreeable, excessively complimentary, and sycophantic to a ridiculous degree.

Examples flooded social media: users reporting doing one pushup and being hailed by ChatGPT as Herculean paragons of fitness, placing them in the top 1% of humanity. Terrible business ideas were met with effusive praise and encouragement to quit jobs.

This wasn't just quirky; it was potentially harmful. As Yam pointed out, people use ChatGPT for advice on serious matters, tough conversations, and personal support. A model that just mindlessly agrees and validates everything, no matter how absurd, isn't helpful – it's dangerous. It undermines trust and critical thinking.

The community backlash was swift and severe. The key issue, as OpenAI admitted in their Announcement and AMA with Joanne Jiang (Head of Model Behavior), seems to stem from focusing too much on short-term engagement feedback and not fully accounting for long-term user interaction, especially with memory now enabled.

In an unprecedented move, OpenAI rolled back the update. I honestly can't recall them ever publicly rolling back a model behavior change like this before. It underscores the severity of the issue.

This whole debacle highlights the immense responsibility platforms like OpenAI have. When your model is used by half a billion people daily, including for advice and support, haphazard releases that drastically alter its personality without warning are unacceptable. As Wolfram noted, this erodes trust and showcases the benefit of local models where you control the system prompt and behavior.

My takeaway? Critical thinking is paramount. Don't blindly trust AI, especially when it's being overly complimentary. Get second opinions (from other AIs, and definitely from humans!). I hope OpenAI takes this as a serious lesson in responsible deployment and testing.

BREAKING NEWS: Claude.ai will support tools via MCP

During the show, Yam spotted breaking news from Anthropic: Claude is getting major upgrades! (Tweet)

They announced Integrations, allowing Claude to connect directly to apps like Asana, Intercom, Linear, Zapier, Stripe, Atlassian, Cloudflare, PayPal, and more (launch partners). Developers can apparently build their own integrations quickly too. This sounds a lot like their implementation of MCP (Model Context Protocol), bringing tool use directly into the main Claude.ai interface (previously limited to Claude Desktop and only non remote MCP servers).

This feels like a big deal!

Google Updates & LlamaCon Recap

    Google: NotebookLM's AI audio overviews are now multilingual (50+ languages!) (X Post). Gemini 2.5 Flash (the faster, cheaper model) was released shortly after our last show, featuring hybrid reasoning with an API knob to control thinking depth. Rumors are swirling about big drops at Google I/O soon!

    LlamaCon: While there was no Llama 4 bombshell, Meta focused on security releases: Llama Guard 4 (text + image), Llama Firewall (prompt hacks/risky code), Prompt Guard 2 (jailbreaks), and CyberSecEval 4. Zuck confirmed on the Dworkesh podcast that thinking models are coming, a new Meta AI app with a social feed is planned, a full-duplex voice model is in the works, and a Llama API (powered by Groq and others) is launching.


This Week's Buzz from Weights & Biases 🐝

Quick updates from my corner at Weights & Biases:

    WeaveHacks Hackathon (May 17-18, SF): Get ready! We're hosting a hackathon focused on Agent Protocols – MCP and A2A. Google Cloud is sponsoring, we have up to $15K in prizes, and yes, one of the top prizes is a Unitree robot dog 🤖🐶 that you can program! (I expensed a robot dog, best job ever!). Folks from the Google A2A team will be there too. Come hack with us in SF! Apply here. It's FREE to participate!

    Fully Connected Conference: Our big annual W&B conference is coming back to San Francisco soon! Expect amazing speakers (last year, Meta announced Llama 3!). Tickets are out: fullyconnected.com.


Evals Deep Dive with Hamel Husain & Shreya Shankar

Amidst all the model releases and drama, we were incredibly lucky to have two leading experts in AI evaluation, Hamel Husain (@HamelHusain) and Shreya Shankar (@sh_reya), join us.

Their core message? Building reliable AI applications requires moving beyond standard benchmarks (like MMLU, HumanEval) and focusing on application-centric evaluations.

Key Takeaways:

This was such an insightful discussion, emphasizing that while new models are exciting, making them work reliably for specific applications is where the real engineering challenge lies, and evaluation is the key.

Vision & Video: Runway Gets Consistent

The world of AI video generation continues its rapid evolution.

Runway References: Consistency Unlocked

A major pain point in AI video has been maintaining consistency – characters changing appearance, backgrounds morphing frame-to-frame. Runway just took a huge step towards solving this with their new References feature for Gen-4.

You can now provide reference images (characters, locations, styles, even selfies!) and use tags in your prompts (<char1>, <style1>) to tell Gen-4 to maintain those elements across generations. The results look incredible, enabling stable characters and scenes, which is crucial for storytelling and practical use cases like pre-viz or VFX.

AI Art & Diffusion

HiDream E1: Open Source Ghibli Style

A new contender in open-source image generation emerged: HiDream E1. (HF Link) This model, from Vivago.ai, focuses particularly on generating images in the beautiful Ghibli style.

The weights are available (looks like Apache 2.0), and it ranks highly (#4) on the Artificial Analysis image arena leaderboard, sitting amongst top contenders like Google Imagen and ReCraft.

Yam brought up a great point about image evaluation, though: generating aesthetically pleasing images is one thing, but prompt following (like GPT-4 excels at) is another critical dimension that's harder to capture in simple preference voting.

Final Thoughts: Responsibility & Critical Thinking

Phew! What a week. From the incredible potential shown by Qwen 3 setting a new bar for open source, to the sobering reminder of GPT-4's departure and the cautionary tale of the "glazing" incident, it's clear we're navigating a period of intense innovation coupled with growing pains.

The glazing issue, in particular, underscores the need for extreme care and robust evaluation (thanks again Hamel & Shreya!) when deploying models that interact with millions, potentially influencing decisions and well-being. As AI becomes more integrated into our lives – helping us boil eggs (yes, I ask it stupid questions too!), offering support, or even suggesting purchases – we must remain critical thinkers.

Don't outsource your judgment entirely. Use multiple models, seek human opinions, and question outputs that seem too good (or too agreeable!) to be true. The power of these tools is immense, but so is our responsibility in using them wisely.


Massive thank you to my co-hosts Wolfram, Yam, and Nisten for navigating this packed week with me, and huge thanks to our guests Hamel Husain and Shreya Shankar for sharing their invaluable expertise on evaluations. And of course, thank you to this amazing community – hitting 1000 listeners! – for tuning in, commenting, and sharing breaking news. Your engagement fuels this show!

🔗 Subscribe to our show on Spotify: thursdai.news/spotify

🔗 Apple: thursdai.news/apple

🔗 Youtube: thursdai.news/yt (get in before 10K!)

And for the full show notes and links visit

👉 thursdai.news/may-1 👈

We'll see you next week for another round of ThursdAI!

Alex out. Bye bye!


ThursdAI - May 1, 2025 - Show Notes and Links

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Qwen 3 Phi-4-Reasoning GPT-4 ChatGPT Claude
相关文章