少点错误 01月25日
How are Those AI Participants Doing Anyway?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在社会科学研究中使用大型语言模型(LLMs)替代人类参与者的可行性。一些研究表明,LLMs在某些心理学实验中表现良好,能够模仿人类的道德判断、投票倾向和行为模式。然而,其他研究人员对这种做法表示担忧,认为可能存在偏见、可重复性问题以及科学单一文化传播的风险。文章深入分析了相关实验,并讨论了研究人员之间的争议,旨在评估未来两到五年内LLMs作为代理参与者在社科研究中的普及程度。

🤔 道德判断:Dillion等人的研究发现,LLM在道德场景判断上与人类平均水平有0.95的高度相关性,表明LLM在一定程度上能够模拟人类的道德直觉。为排除数据泄露,研究人员对比了LLM对道德场景前缀的自动补全与数据集中的实际补全,结果显示两者不同,从而排除了数据泄露的可能性。

🗳️ 投票预测:Argyle等人的研究表明,LLM在预测选民投票倾向方面具有一定的准确性,可以反映人类子群体的特征。在自由形式的党派文本实验中,人类评判员无法有效区分人类和LLM生成的文本。在投票预测任务中,LLM的预测结果与实际投票结果具有一定程度的相似性。

🎭 图灵实验:Aher等人的研究利用图灵实验框架评估LLM的“类人”能力。结果显示,大型模型在最后通牒博弈、花园路径语句和米尔格拉姆实验中,行为与人类相似。在智力问答中,大型模型能够正确回答问题,而小型模型则无法正确回答。

📊 意见调查:Santurkar等人的研究发现,LLM的默认意见分布与美国总人口的意见分布并不一致。通过提示引导LLM模拟特定人口群体的意见,可以在一定程度上提高LLM的代表性,但无法减少群体意见分布的差异。此外,LLM在不同主题上的意见一致性较低。

🎮 博弈论:Fan等人的研究测试了LLM在博弈论游戏中的理性程度。结果显示,LLM在独裁者博弈中,根据预设的“愿望”做出决策。在石头剪刀布游戏中,LLM的表现取决于对手的策略。

Published on January 24, 2025 10:37 PM GMT

[TL;DR] Some social science researchers are running psychology experiments with LLMs playing the role of human participants. In some cases the LLMs perform their roles well. Other social scientists, however, are quite concerned and warn of perils including bias, reproducibility, and the potential spread of a scientific monoculture.

Introduction

I first heard about "replacing human participants with AI agents in social science experiments" in an AI Safety course. It was in passing during a paper presentation, but upon hearing this I had an immediate and visceral feeling of unease. There's no way that would work, I thought. At the time there was already a lively discussion, so I didn't get a chance to ask the presenters any questions though I had many. How were the experiments implemented? How did the researchers generate a representative population? What were the outcomes? Instead, the conversation moved on.

The second time was while doing background reading for AISI, and here I was able to mull over the idea more carefully. Now it no longer seems so absurd. After all, in mathematics, the law of large numbers says that the average of many independent and identically distributed random variables (i.i.d.) are well approximated by one random variable, namely a Uniform random variable with the same mean and variance, and, as Francis Galton loathed to discover, the wisdom of the crowd was able to accurately predict the weight of a cow where an individual could not. It seems possible that the foremost models would be able to imitate the "average” person and that this was sufficient for some social science research, to say nothing of the immediate savings in time and money that researchers would reap if they were to employ silicone sampling [2]

The remainder of is article is not a survey. Many already exist [16, 17]. Instead, I would like to provide the necessary background to justify my answer to the following question: In the next two to five years, how common will it be to use LLMs as proxy participants in social science research? First, we will go over some experiments which explore the feasibility of the idea. Then we take an in-dept look at the discussions between researchers resulting from these experiments.  

What's Been Done

Several recent works explore an AI agent's capability to behave rationally [1, 7] and morally [6] as comparable to their human counterparts, and hold opinions representative of the population [2, 12].

Dillion et al. [6] investigated the similarity between human and LLM judgment on 464 moral scenarios from previous psychology research.

Argyle et al. [2] explored the use of language models as a proxy for specific human subpopulations in vote prediction. They define the degree with which models can accurately reflect human subpopulations as algorithmic fidelity. At a bare minimum, they wanted model responses to meet four criteria:

    (Social Science Turing Test) Be indistinguishable from human responses.(Backward Continuity) Are consistent with the conditioning context[1] of the input, i.e., humans can infer key elements of the input by reading the responses.(Forward Continuity) Are natural continuations of the context provided, i.e., they reflect the form, tone, and content of the input.(Pattern Correspondence) Reflect underlying patterns found in human responses. These include relationships between ideas, demographics, and behaviors.

Aher et al. [1] investigate the ability for LLMs to be "human like" which they define using the Turing Experiment (TE) framework. Similar to a Turing Test (TT), in a TE a model is first prompted using demographic information (e.g. name, gender, race) and then asked to answer questions or behave according to the simulated individual (in practice this amounts to specifying a title, Mr. or Ms., and a surname from a "racially diverse" pool using 2010 US Census Data).

Santurkar et al. [12] use well-established tools for studying human opinions, namely public opinion surveys, to characterize LLM opinions and created the OpinionQA dataset. 

Fan et al. [7] tested the rationality of LLMs in several standard game theoretic games.

Analysis

While these results are impressive and suggest that models can simulate the responses of human participants in a variety of experiments, many social scientists have voiced concerned in the opinion pages of journals. A group of eighteen psychologists, professors of education, and computer scientists [5] highlighted four limitations to the application of LLMs to social science research, namely: the difficulty of obtaining expert evaluations, bias, the "black box" nature of LLM outputs, and reproducibility. Indeed, every single paper mentioned above used OpenAI models as either the sole or primary tools for generating experimental data. While some researchers (e.g.[12]) used other models, these still tended to be private.

For many social science researchers using OpenAI's private models offer many benefits: they are accessible, easy to use, and often more performant than their open source counterparts. Unfortunately, these benefits come at the cost of reproducibility. It is well known that companies periodically update their models [11] so the results reported in the papers maybe difficult or impossible to reproduce if a model was substantially changed or altogether retired.

Spirling [13], in an article for World View, writes that "researchers should avoid the lure of proprietary models and develop transparent large language models to ensure reproducibility"'. He advocates for the use of Bloom 176B [8], but two years hence, its adoption is underwhelming. At time of writing, the Bloom paper [8] has about 1,600 citations which is a fraction of the 10,000 citations of even Llama2 [15] which came out after (a fine-tune version of Bloom called Bloomz [10] came out in 2022 and has approximately 600 citations).

Crockett and Messeri [4], in response to the work of Dillion et al. [6], noted the bias inherent in OpenAI models' training data (hyper-WEIRD). In a follow up [9], the authors went in depth to categorize AIs as Oracles, Surrogates, Quants, and Aribiters and briefly discussed how each AI intervention might lead to even more problems. Their primary concern is the inevitable creation and entrenchment of a scientific monoculture. A handful of models which dominate the research pipeline from knowledge aggregation, synthesis, to creation. The sea of AI generated papers will overwhelm most researchers and only AI assistants will have the time and patience to sift through it all.

Even though their concerns are valid, their suggestion to "work in cognitively and demographically diverse teams'' seem a bit toothless. Few will reject a cross-discipline offer of collaboration, except for mundane reasons of time or proximity or a myriad other pulls on an academic's attention. However this self-directed and spontaneous approach seems ill-equip to handle the rapid proliferation AI. They also suggest "training the next generation of scientists to identify and avoid the epistemic risks of AI'' and note that this will "require not only technical education, but also exposure to scholarship in science and technology studies, social epistemology and philosophy of science'' which, again, would be beneficial if implemented, but ignores the reality that AI development far outpaces academia's chelonian pace of change. Scientific generations takes years to come to maturity and many more years before those change filters through the academic system and manifests as changes in curriculum.

Not all opinions were so pessimistic. Duke professor Bail [3] notes that AI could "improve survey research, online experiments, automated content analyses, agent-based models''. Even in the section where he lists potential problems, e.g., energy consumption, he offers counter-points (Tomlinson et al. [14] suggests that carbon emission of writing and illustrating are lower for AI than for humans). Still, this is a minority opinion. Many of the works that Bail espouses which use AI --- including his own --- are still in progress and have yet to pass peer review at the time of writing.

From this, it's safe to suspect that in the next two to five years it is unlikely that we will see popular use of AI as participants in prominent social science research. However, outside of academia, the story might be quite different. Social science PhDs who become market researchers, customer experience analysts, product designers, data scientists and product managers may begin experimenting with AI out of curiosity or necessity. Without established standards for the use of AI participants from academia, they will develop their own ad-hoc practices in private. Quirks of the AI models may subtly influence the products and services that people use everyday. Something feels off. It's too bad we won't know how it happened. 

Conclusion

I am not a social scientist and the above are my conclusions after reading a few dozen papers. Any junior PhD student in Psychology, Economics, Political Science, and Law would have read an order of magnitude more. If that's you, please let me know what you think about instances where AI has appeared in your line of work.

References

[1] Aher, G. V., Arriaga, R. I., & Kalai, A. T. (2023, July). Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning (pp. 337-371). PMLR.

[2] Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337-351.

[3] Bail, C. A. (2024). Can Generative AI improve social science?. Proceedings of the National Academy of Sciences, 121(21), e2314021121.

[4] Crockett, M., & Messeri, L. (2023). Should large language models replace human participants?.

[5] Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., ... & Pennebaker, J. W. (2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688-701.

[6] Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI language models replace human participants?. Trends in Cognitive Sciences, 27(7), 597-600.

[7] Fan, C., Chen, J., Jin, Y., & He, H. (2024, March). Can large language models serve as rational players in game theory? a systematic analysis. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 16, pp. 17960-17967).

[8] Le Scao, T., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., ... & Al-Shaibani, M. S. (2023). Bloom: A 176b-parameter open-access multilingual language model.

[9] Messeri, L., & Crockett, M. J. (2024). Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002), 49-58.

[10] Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Scao, T. L., ... & Raffel, C. (2022). Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.

[11] OpenAI. GPT-3.5 Turbo Updates

[12] Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023, July). Whose opinions do language models reflect?. In International Conference on Machine Learning (pp. 29971-30004). PMLR.

[13] Spirling, A. (2023). Why open-source generative AI models are an ethical way forward for science. Nature, 616(7957), 413-413.

[14] Tomlinson, B., Black, R. W., Patterson, D. J., & Torrance, A. W. (2024). The carbon emissions of writing and illustrating are lower for AI than for humans. Scientific Reports, 14(1), 3732.

[15] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[16] Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., ... & Wen, J. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), 186345.

[17] Xu, R., Sun, Y., Ren, M., Guo, S., Pan, R., Lin, H., ... & Han, X. (2024). AI for social science and social science of AI: A survey. Information Processing & Management, 61(3), 103665.

  1. ^

    This is the attitudes and socio-demographic information of a piece of text.

  2. ^

    To ensure that the model outputs responses representative of the US population instead of their training data, the researchers prompted the model with backstories whose distribution matches that of demographic survey data (i.e., ANES).

  3. ^

    An example of such a sentence is: While Anna dressed the baby that was small and cute spit up on the bed.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 社会科学 心理学实验 AI伦理 图灵测试
相关文章