Recursal AI development blog 03月25日
🪿Qwerky-72B and 32B : Training large attention free models, with only 8 GPU's
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Qwerky-72B和32B模型更新,在多个方面表现出色。该模型不基于Transformer架构,超越了GPT-3.5 turbo,接近GPT-4O-mini,具有低推理成本等优势。其训练方法独特,且对AI知识的理解有新观点。

🎈Qwerky模型更新,72B和32B模型可在多个平台使用

💪模型超越GPT-3.5 turbo,接近GPT-4O-mini,性能优越

📝训练方法独特,冻结原有模型权重,替换为RWKV并分阶段训练

🤔认为AI模型知识主要在FFN层,而非注意力机制

We are proud to announce the updated Qwerky-72B and 32B.

Both models are available on huggingface and featherless.ai

The largest model to date - that is not based on the transformer attention architecture.

Surpassing existing transformer models in several benchmarks, while following right behind in others.

This builds on our previous experiments in converting the QRWKV6, where we converted the previous Qwen 2.5 32B model to RWKV. And the previous 72B preview.

Which we applied instead for the Qwen-QwQ-32B model and the Qwen-72B model respectively.

But lets take a step back at what this means …


We now have a model far surpassing
GPT-3.5 turbo, without QKV attention

While slowly closing in on GPT-4O-mini

With lower inference cost, param size, and better performance.

In 2024: When we proposed scaling up RWKV to replace attention.
Many believed transformer attention, is the only viable path
to GPT 3.5 or better intelligence. Today this is disproven false.

We need no super cluster - only a single server.

Because we were keeping most of the feed forward network layer the same.
We can perform the conversion, (barely) within a single server of 8 MI300 GPU’s

Requiring the full 192GB VRAM allocation per GPU


How the conversion is done: A summary

While more details will be revealed in an upcoming paper. The core idea is similar to the previous QRWKV6 conversion

At a high level, you take an existing transformer model

Freeze all the weights, delete the attention layer, replace it with RWKV, and train it through multiple stages

All while referencing the original model logits as a “teacher model”

More specifically it would be the following

Unfortunately, due to the limitation of VRAM, our training was limited to 8k context length. However we view this as a resource constraint, and now a method constraint.


AI knowledge, is not in attention, but FFN

Due to the limited token training of 200-500M, of the converted layers. We do not believe that the newly trained RWKV layers, is sufficiently trained for “knowledge/intelligence” at this level.

In other words, the vast majority of an AI model knowledge, is not in the attention but the matrix multiplication FFN (Feed-Forward-Network) layer.

It would be more accurate, to view the Attention mechanisms, be it transformer based, or RWKV. As a means of guiding the model to focus on “what the model thinks” about in the FFN layer.


Ideal for large volume


We can now rapidly iterate new RWKV architectures on <100B scales

By dramatically reducing the compute requirement for scaling and testing a new RWKV attention architecture. To a small number of GPU’s

We will be able to test, iterate, and validate newer architecture changes faster, taking experiments what previously took weeks (or even months), to days.

Historically, the RWKV group has been averaging 4 major versions across 2 years. With improvement to both model architecture accuracy and memories at every step.

A trend which we plan to accelerate moving forward.

As we work on our roadmap to Personalized AI and eventually Personalized AGI, which you can see more in our following article …

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Qwerky模型 超越GPT 训练方法 AI知识
相关文章