Artificial Fintelligence 2024年10月22日
Papers I've read this week
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者分享了近期阅读的几篇有趣论文,涵盖了路由模型、互联网探索和LLaMa等领域。作者对路由模型在不同规模的语言模型中的表现进行了分析,并发现路由网络在较小的模型中效果更显著。作者还介绍了一篇关于互联网探索的论文,该论文提出了一种通过动态探索互联网来收集更多数据的方法,并对该方法的潜力进行了讨论。最后,作者分享了对Facebook发布的LLaMa论文的看法,并对该论文中的一些关键改进进行了分析。

😊 **路由模型的性能和应用** 作者分析了三类路由模型(稀疏MoE、非参数HASH和RL路由)的性能,发现它们遵循相同的缩放规律,RL路由表现出色。作者发现路由网络在较小的模型中效果更显著,并建议在参数量小于1.3B的模型中使用路由网络。作者还建议使用S-Base作为默认路由算法,并建议使用64-128个专家,每个层使用1个专家,路由层的频率在0.5到1之间。

🤖 **互联网探索的潜力** 作者介绍了一篇关于互联网探索的论文,该论文提出了一种通过动态探索互联网来收集更多数据的方法,并使用MoCo-v3训练模型来学习图像表示。该方法通过查询Google Images并根据与目标数据集的相似度对查询结果进行排名来获取新图像,并对新图像进行奖励评估。作者认为这种方法可以为LLMs提供更多数据,并认为RLHF/RLAIF可以用来训练LLM生成查询。

🧠 **LLaMa模型的改进** 作者分享了对Facebook发布的LLaMa论文的看法。作者发现LLaMa模型在规模减小的情况下仍然取得了显著的性能提升,其13B模型在多个基准测试中超过了GPT-3,65B模型与Chinchilla-70B和PaLM-540B竞争。作者对LLaMa模型的改进进行了分析,包括使用公开数据、对每个Transformer子层的输入进行归一化、使用SwiGLU激活函数和旋转嵌入等。作者认为LLaMa模型的改进值得进一步研究。

Hey folks! Welcome to my newsletter. I’ve got an article that I’m hoping to drop early in the week about a potential solution to one of the biggest problems I see with LLMs: the lack of data. Until then, I’ve got a post about some of the most interesting papers I’ve read in the last week.

I’m going to try to write a weekly summary of the most interesting papers I’ve read that week. I’d love to hear what papers you’ve been reading, if you agree/disagree about my conclusions for each paper, and/or suggestions for what papers I should read next!

Thanks for reading Artificial Fintelligence! Subscribe for free to receive new posts and support my work.

Scaling laws for routed language models

Abstract

I used to work with Aidan, and way back in 2021, he was insistent that LLMs were the future of AI. I thought he was crazy. In ‘22, he also insisted that conditional routing models were the future. Given how right he was about LLMs, it’s probably worth paying attention to conditional routing models.

The paper provides a great general overview of how routing networks work and performance comparisons (in terms of negative log likelihood over a validation dataset) for the 3 most common routing techniques (sparse MoE, non-parametric HASH, RL routing).

The authors trained a large number of conditional routing networks, and fit scaling laws to the results; they find that all 3 techniques follow the same scaling laws, with RL routing doing quite well. I’d be curious to see how much effort has been put into improving RL routing; I suspect that it could be improved significantly.

The authors observed the following results:

    Routing improves the performance of language models across all sizes and variants attempted

    Training a Routing Network with RL is of comparable effectiveness to state-of-the-art techniques.

    The performance of all Routing Networks is accurately described by scaling laws in the number of experts and in the underlying dense model size.

I was surprised at how similar the performance of the various techniques was. The data was quite nice, with little variation. The scaling laws seem to fit the data quite nicely.

One interesting result was that routing helps significantly more when the model is smaller. I found this surprising; my intuition is that routing should always help. They found that this was the case across all models, and that routing helped less as the models grew.

The paper ends with recommendations, which I found really useful:

    Use routing for models with less than 1.3B parameters

    S-Base is a good, default routing algorithm (defined in the appendix of their paper).

    Target using E in {64, 128} experts.

    Use K = 1 experts; route layers with a frequency between 0.5 & 1; lower frequency reduces performance.

Internet explorer

Abstract

In this paper, the authors create an agent which dynamically explores the internet, running text queries to find images to use for self-supervised training. While seemingly designed to directly antagonize Yudkowsky, the paper is extremely interesting, and presents, to me, a potential future direction for AGI research. As Chinchilla showed us, LLMs could massively improve with more data. Having agents dynamically exploring the internet is one excellent way to get more data- especially if they’re able to adaptively learn over time and prioritize images accordingly.

In the paper, they train a model to learn representations of images based on MoCo-v3. They query Google Images for new images, ranking the query results by similarity to the target dataset, assigning a reward to the new images:

Here, S_cos is the cosine similarity, f_k is the image encoder, D := {x_i} is the target dataset, and y is the new image to evaluate, and they evaluate over the k closest neighbours in the target dataset (where “closest” is determined by the encoded representation).

They create the queries for Google Images by sampling them from a static vocabulary dataset. For the sampling process, they estimate the reward associated with the query using a Gaussian process regression, and choose the queries with the highest reward.

I’d be really interested to see a fancier query generation process. One idea that comes to my mind would be using RL to train a LLM to generate queries in a manner similar to what’s done in RLHF/RLAIF, i.e. use an RL algorithm like PPO to finetune a pretrained LLM to maximize reward. This would require much more compute.

LLaMa

Abstract

I’ve been digesting the LLaMa paper that Facebook released this week. It was very interesting to see the performance increases they got despite the size decreases. Their 13B model outperformed GPT-3 on a number of benchmarks, and their 65B model was competitive with Chinchilla-70B and PaLM-540B (!).

I did find it incredibly frustrating that they stopped training when they did; their loss curves are all looking pretty far from convergence, and I’m curious to see how much the models will continue to improve:

I wish that they had just left it training.

My biggest question about the paper is that it’s not clear what caused the improvements. They discuss a few major changes compared to GPT-3, which their model is based on:

There was no ablation study, unfortunately. If I can scrounge up the GPUs, I’m tempted to do my own ablation based on nanoGPT. They also use FlashAttention, which I suspect will become the default attention implementation used in LLMs going forward.

And that’s it! Thanks for reading. If you have thoughts on any of these, or interesting follow up papers, I’d love to hear them.

1

I haven’t seen a great paper comparing various positional encoding schemes. I don’t really understand which are better, and if this is generally true, or if performance varies in certain scenarios. A proper positional encoding ablation study is on my list of experiments to do once I can scrounge up GPUs.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

路由模型 互联网探索 LLaMa AI论文
相关文章