少点错误 2024年08月15日
GPT-2 Sometimes Fails at IOI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文分析了 GPT-2 模型在执行 IOI 任务(识别句子中的直接宾语)时表现出的局限性,发现 GPT-2 在处理一些特定的姓名组合时,例如以 "Lisa" 作为主语,"Katie" 作为间接宾语的句子,模型无法正确识别出间接宾语。作者还测试了 GPT-2 的其他版本,发现更大的模型表现略有改善,但仍存在问题。

😁 作者通过对 GPT-2 模型进行全面测试,发现该模型在处理特定姓名组合时存在问题,例如当主语为 "Lisa",间接宾语为 "Katie" 时,模型无法正确识别出间接宾语。

🤔 作者进一步分析了模型的错误率分布,发现当主语为 "Lisa" 时,模型对某些间接宾语的识别率明显下降,例如 "Katie"、"Alicia"、"Michelle"、"Samantha" 和 "Lindsay"。

🤩 作者认为,尽管较大的 GPT-2 模型表现略有改善,但仍然存在问题。这表明,即使是大型语言模型,也可能在某些特定情况下出现性能下降,尤其是在处理非标准或罕见情况时。

🧐 作者建议,在开发和评估大型语言模型时,应关注其在极端情况下的性能表现,并考虑使用更全面的测试方法,以确保模型的鲁棒性和可靠性。

Published on August 14, 2024 11:24 PM GMT

tl;dr: For Lisa, GPT-2 does not do IOI. GPT-2 fails to perform the IOI task on a significantly nonzero fraction of names used in the original IOI paper.

Code for this post can be found at https://github.com/ronakrm/ioi-enumerate.

Unintentionally continuing the trend of "following up" on the IOI paper, I ran GPT-2 Small on all possible inputs that fit the original BABA templates, PLACE/OBJECT tokens, and set of names for Subjects and Indirect Objects. This results in 9 million strings, and instead of just looking at the mean logit diff between the subject and indirect object tokens, let's look at the distribution.

These look pretty decent, but there's obviously some mass below zero! For what percent of the 9 million inputs does GPT-2 incorrectly predict the Subject instead of the Indirect Object as the higher logit? 1.348%, or about 125,000 out of the ~9 million sentences!

We can dig in a bit deeper and try to identify if a structured subset of the data is where the model consistently fails. We can identify these subsets by looking at the conditional means and finding the ones that are furthest from either the global mean or the when that condition is inverted. In other words, we can split our data into groups which have the subject as X and not X, the IO as X and not X, etc., and then sort by the mean difference between these groups to get an idea. (check out the notebook in the repo)

If we restrict our data to this subset and do this procedure again,

we can find out that in a large portion of cases where the subject is Lisa and the indirect object is Katie, GPT-2 Small fails to perform the IOI task correctly. In fact there appear to be a number of other Indirect Object names that consistently perform poorly when the Subject is Lisa:

IOmeanstdalt_meanalt_std
Katie0.0177701.1872641.8174431.349228
Alicia0.1962361.0653181.8156031.352604
Michelle0.2060260.9380981.8155021.353694
Samantha0.2323681.1062461.8152311.352706
Lindsay0.2757090.9389801.8147841.354523

The notebook and other code in the repository have slightly more exploration, and is reasonably easy to run and extend so feel free to poke!

A Quick Check on Larger Models

For GPT-2 Medium, the number of examples with a negative logit difference is 4143, or 0.044% of all ~9M samples.

And for GPT-2 Large, 5986, or 0.064% of all ~9M samples.

For both of these slicing by first-order obvious dataset groups did not show anything interesting (check the notebook in the repo).

Final Thoughts

When we can, we should brute force all inputs that make reasonable sense and look at the full distribution. I'm becoming more interested generally in bounding worst-case behaviors as a safety angle: this is one toy setup where the worst-case is not being handled correctly. If your name is Lisa or Katie you may feel this more concretely, let alone if your name is uncommon, non-Western, or multi-token. As we worry more and more about extreme tail-risk failure modes, it's a good idea to keep things like this in mind, and perhaps ideas in fairness and more mainstream machine learning may be good "model organisms" for demonstrating and studying these failure modes.

I think it's good to worry about these kinds of issues as we attempt to scale interpretability approaches to large models, and I'm glad that new approaches for ensuring the robustness and faithfulness of interpretability results are becoming more popular. Specifically I'm excited that work like Hypothesis Testing the Circuit Hypothesis in LLMs and Transformer Circuit Faithfulness Metrics are not Robust are becoming a bit more mainstream; I share a lot of their thoughts and am excited and optimistic to see this grow!

I've had the idea for this exploration for a while, wrote up a draft as part of a MATS application, and subsequently wrote it up during MATS 6.0 Summer 2024.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPT-2 IOI 任务 语言模型 鲁棒性 测试方法
相关文章