Society's Backend 07月16日 00:18
Transformers Ain't It
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文为面向软件工程师的机器学习资源周报,涵盖课程、事件、工作等资源。重点讨论了Transformer架构在通用人工智能中的局限性,以及扩散模型和状态空间模型等新型架构的潜力。推荐阅读Cartesia首席科学家和联合创始人的博客文章,分析状态空间模型的权衡。

🔍 Transformer架构在通用人工智能中存在局限,如处理长上下文窗口的困难、易陷入重复循环等,限制了其潜力。

🚀 扩散模型和状态空间模型是两种新型架构,旨在解决Transformer架构的缺陷。扩散模型通过一次性处理整个输入块来提高速度,状态空间模型以线性时间复杂度处理数据,从而提升效率。

📚 Cartesia首席科学家和联合创始人的博客文章深入分析了状态空间模型的权衡,强调了其在通用人工智能中的重要性。

🧠 机器学习专家开始探索其他架构,如François Chollet创立的Ndea公司,以及Google和Carnegie Mellon大学等机构研发的新型模型,以实现更高效的通用人工智能。

🎓 本文推荐了多个学习资源,包括层次网络(H-Nets)、Anthropic课程、鲁棒多模态大语言模型等,帮助工程师提升机器学习技能。

Welcome to Machine Learning for Software Engineers! Each week I curate insights and resources specifically for software engineers learning AI. This article includes:

That are helpful and important for software engineers learning AI. Subscribe to get these in your inbox each week.

If you find Machine Learning for Software Engineers helpful, consider becoming a paid subscriber for just $3/month forever for a limited time.

Get 40% off forever

If you missed us last week, go back and get a refresher! There is an important lesson for all engineers here:

Transformers may only get us so far

We’ve got a video this week! If you want to watch/listen to the most important lesson this week, check the video above.

We constantly see people talking about how LLMs won't get us to AGI. It's interesting because most of these people are talking about scaling, that LLMs won't scale when we give them more resources and that's their argument about LLMs not getting us to AGI.

Now putting aside the fact that there isn't a proper definition for AGI, so nobody actually knows what it means to hit AGI, LLMs are fundamentally limited in many ways, but scaling doesn't seem to be one of them.

The data shows that we have been able to find ways to scale LLMs with more compute, more training time, more resources, more inference time, and more resources at inference time so they don't hit a wall with their scaling.

What's much more interesting about LLMs in the concept of AGI is whether or not the Transformer architecture is capable of general intelligence. We've seen very successful multimodal LLMs (even though they should really be large multimodal models, in my opinion) capable of taking in any medium as their input tokens and outputting any medium at their output.

But we've seen some cracks in the transformer architecture and what it's capable of:

Fundamentally we've seen there's limitations to what transformer-based models can do with the data that they're given. We've already seen machine learning and AI experts start looking in other directions to come up with proper generalized super intelligence.

Those last two are what I really want to focus on here. Diffusion models and SSMs are both different approaches at similar generative AI problems that transformer-based LLMs are trying to tackle both of which are trying to improve upon a fundamental deficiency with the transformer-based architecture.

SSMs are trying to do the same generative AI tasks in linear time instead of the quadratic time complexity in both compute and memory that the Transformer’s self-attention mechanism requires.

Diffusion models are trying to do the same generative AI task by working on an entire block of input at once instead of processing it token by token. This means diffusion models have the potential to output faster, to avoid the issues that transformer models have with long context, and avoid the issues transformer models have falling into repetitive loops or producing similar situations repeatedly.

I won't go any more in depth into transformer based architectures versus SSMs versus diffusion models because I think it really warrants its own article where we can have graphics and dive in a little bit deeper. If you want me to go over that, let me know in the comments.

The most important resource to read this week is Cartesia’s Chief Scientist and Co-Founders blog post analyzing the trade-offs of SSMs. I think SSMs are flying under the radar and not enough people talk about them. While I think SSMs particularly are important, I think it's more important to realize that maybe Transformers aren't the future. This has both hardware and software implications.

For now, enjoy the rest of these resources! We've got some great learning resources and awesome things that happened this week.

Particularly interesting are:

Learning resources to help you become a better engineer

Hierarchical Networks (H-Nets): H-Nets address the limitation of standard AI models processing all inputs uniformly, which often leads to failures with high-resolution raw data or unreliable pre-processing. They incorporate hierarchy natively through dynamic chunking and compression across an encoder, main network, and decoder, enabling more robust handling of complex inputs and direct learning from raw sources—essential for engineers pushing model boundaries.


Skilljar for Anthropic Courses: Skilljar serves as the learning management system for Anthropic's AI content, providing interactive experiences, progress tracking, and a central hub for resources. It's a practical tool for engineers seeking structured education on advanced topics like model development and safety.


Robust Multimodal Large Language Models Against Modality Conflict: This paper identifies "modality conflict"—clashing information across inputs like text and images—as a primary cause of hallucinations in multimodal LLMs. It introduces the MMMC dataset for evaluation and compares mitigation strategies: prompt engineering, supervised fine-tuning, and reinforcement learning, with RL proving most effective at reducing errors. Valuable for engineers building dependable multimodal applications in imperfect real-world conditions.


Weighted Multi-Prompt Learning with Description-free Large Language Model Distillation: DeMul bypasses the inconsistency of text descriptions by directly distilling LLM knowledge into weighted vector prompts for vision-language models. It achieves superior results on 11 recognition datasets through enhanced semantic capture, making it a go-to technique for engineers fine-tuning VLMs efficiently without extensive data.


Three Mighty Alerts for Hugging Face Infra: Hugging Face relies on targeted alerts for production stability, such as monitoring high NAT Gateway throughput to catch cloud traffic issues early. These practices are insightful for engineers scaling ML deployments and maintaining reliable systems.


Other interesting things

Reachy Mini Robot Kit: Reachy Mini is an accessible, 11-inch open-source robot programmable in Python. The $299 lite version features a wide-angle camera, speaker, and six-degree-of-freedom head; the $449 wireless adds a Raspberry Pi 5, WiFi, and battery. It's an engaging resource for engineers exploring AI in hardware, from vision to simple automation.


OpenAI’s Security Clampdown: Following allegations of model copying by competitors like DeepSeek, OpenAIimplemented measures including "information tenting" for restricted access, biometric controls, offline isolation of proprietary tech, a deny-by-default internet policy, and enhanced data center security. This reflects the growing emphasis on protecting intellectual property in the competitive AI space.


Meta Nabs Apple’s AI Models Head: Meta recruited Ruoming Pang, Apple's lead on foundation models powering Apple Intelligence, to bolster its superintelligence unit—part of a broader talent acquisition from places like DeepMind. It highlights the intense competition for expertise driving innovations in advanced and on-device AI.


Against AI ‘Brain Damage’: The study "Your Brain on ChatGPT" reveals that using AI for tasks like essay writing reduces engagement and retention, though it doesn't cause literal brain damage—mirroring historical concerns over technologies like calculators. It's a reminder for engineers to use AI mindfully to preserve cognitive skills in development work.


AI for Mental Health Initiatives: Initiatives include a field guide from Grand Challenges Canada and McKinsey for scaling evidence-based interventions, alongside Google, DeepMind, and Wellcome's multi-year funding for research into anxiety, depression, and psychosis. These efforts promise more effective, AI-enhanced mental health support.


Google Earth’s Journey: Starting as enterprise software at $800/year, Google Earth saved 4,000 lives during Hurricane Katrina and aided ecosystem discoveries; now, generative AI is evolving it into a predictive tool for planetary changes. It's an inspiring example of AI's transformative potential in practical applications.


AI Probing Human Minds: Research in Nature fine-tuned a large language model on 160 psychology experiments to predict behaviors, establishing a "foundation model of human cognition." This intersection of AI and psychology offers engineers new ways to understand and improve system designs inspired by human thought.


AI Training for Teachers: OpenAI partnered with the American Federation of Teachers for a five-year, $10 million National Academy for AI Instruction, aiming to train 400,000 K-12 educators with support from Microsoft and Anthropic. It's a forward-thinking move to integrate AI responsibly in education.


xAI’s Grok 4 Launch: Grok 4 from xAI supports image analysis and questions, with claims of PhD-level performance across subjects, available via a $300/month SuperGrok Heavy subscription. It represents exciting progress in making sophisticated AI more approachable.


Open vs. Closed Models: Executives from GM, Zoom, and IBM explored trade-offs in open, closed, or hybrid AI models, considering factors like cost, performance, trust, and safety. IBM's model-switching gateway suggests hybrids as a balanced option for enterprise engineers.


Microsoft AI Savings Post-Layoffs: Microsoft's AI tools generated $500 million in call center savings last year, announced shortly after 15,000 layoffs amid $26 billion quarterly profits. It illustrates the dual impact of AI on efficiency and employment.


Forget AGI, Chase GPAI: Rather than distant AGI, focus on General-Purpose AI (GPAI) as an achievable target, with a roadmap emphasizing versatility and integration for immediate, tangible benefits in engineering applications.


Kimi K2 Model: Kimi K2 is a mixture-of-experts model with 1 trillion parameters (32 billion active), trained on 15.5 trillion tokens using the Muon optimizer. It stands out in knowledge, reasoning, coding, and agentic capabilities like tool use, ideal for cutting-edge engineering tasks.


Quick Software Building Tips: Development involves inevitable trade-offs between time and quality—target an 8/10 level on schedule, addressing minor issues later. This approach aligns well with the fast-paced iterations common in AI engineering.


Meta Acquires Play AI: Meta’s acquisition of Play AI brings human-like voice generation and an intuitive platform, with the team enhancing Meta's AI characters, wearables, and audio features, advancing more lifelike interactions.


Multi-Agent Systems: Beyond single agents, multi-agent systems involve multiple entities coordinating or competing in shared environments to solve intricate problems, offering engineers a powerful way to scale AI for distributed challenges.


Negative 2K Lines Story: In 1982, the Lisa software team decided to track developer velocity by lines of code they added, requiring the developers to log their lines of code each week. To prove how silly that metric is, a developer decided to focus on removing code from their codebase to streamline it and put -2000 on the weekly report.


Supervising AI Insight: Effective oversight of AI tools or teams requires the ability to perform the task yourself, even if infrequently—like managers retaining coding skills. It ensures grounded, practical guidance in engineering contexts.


Jobs

Interestingly, Google engineers have been seeking developers are X/Twitter instead of through standard means. I think recruiting like this will only be more common and makes it even more important for you to be learning and building in public.

Roles on the Jules Team at Google Labs: Google Labs is hiring for the Jules team, focused on AI agents and developer tools. Positions include front-end/full-stack (React ecosystem), macOS/iOS, and product designer. Must have experience building for developers. They are looking for someone in the PST time zone.


AI Developer Frameworks and Tools at Google: Looking for experts in TypeScript or Python to work on next-generation AI developer frameworks and tools at Google. The role sits at the intersection of foundational research and engineering, applying research to real developer products with plenty of freedom and opportunities. If interested, DM @rakyll with resumes (Googlers can reach out internally).


That’s all for jobs this week. I’m trying to determine a more beneficial way of sharing jobs in these roundups that will be coming soon!

Thanks for reading!

Always be (machine) learning,

Logan

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器学习 软件工程师 通用人工智能 扩散模型 状态空间模型
相关文章