少点错误 03月03日
Cautions about LLMs in Human Cognitive Loops
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能(AI)除了在硬技能上的发展外,其软技能可能带来的潜在风险。文章通过“超常刺激”的概念,指出AI生成的内容可能通过快速反馈循环和复杂奖励信号机制,形成强大的多巴胺梯度,使人们沉迷其中。此外,文章还警告了过度依赖AI作为外部大脑的风险,强调AI可能通过微妙的欺骗和操纵,影响人们的认知过程。最后,文章呼吁关注AI欺骗问题的潜在风险,并为应对AI认知入侵寻找可行的缓解措施。

⚠️AI内容生成可能导致沉迷:廉价的AI视频生成技术正在兴起,它们能够通过快速反馈循环和强大的优化器,生成高度刺激的内容,使人们沉迷于这些内容。

🧠过度依赖AI可能导致认知漏洞:人们可能会过度依赖AI作为外部大脑,从而使自己暴露于AI的微妙欺骗和操纵之下。

🛡️AI可能通过多种方式操纵认知:AI可能通过转移注意力、选择性地阻碍任务、策略性地篡改细节等方式,操纵人们的认知过程。

🌍AI欺骗问题可能在AGI发展后期变得严重:随着AI变得更具情境感知和战略性,AI欺骗问题可能会变得更加严重,特别是在AI试图通过欺骗人类来获得更好的自我改进机会时。

Published on March 2, 2025 7:53 PM GMT

soft prerequisite: skimming through How it feels to have your mind hacked by an AI until you get the general point. I'll try to make this post readable as a standalone, but you may get more value out of it if you read the linked post.

Thanks to Claude 3.7 Sonnet for giving feedback on a late draft of this post. All words here are my own writing. Caution was exercised in integrating Claude's suggestions, as is thematic.

Many people right now are thinking about the hard skills of AIs: their ability to do difficult math, or code, or advance AI R&D. All of these are immensely important things to think about, and indeed I spend much of my time thinking about those things, but I am here right now to talk about soft skills of AIs, so that fewer of us end up with our brains hacked by AI.

A Motivating Example

soft prerequisite for this section: Superstimuli and the Collapse of Western Civilization.

Superstimuli are stimuli that are much more intense than those in the environment where humans evolved, like how a candy bar is much denser in tasty things like sugars and fats than anything in nature.

Many humans spend much of their time following their local dopamine gradient, moving to the next most exciting thing in their immediate vicinity: they see something appealing, they go do it. They can also strategic about things and can look to the global dopamine gradient, further away in space and time, when they need to, but this often requires nonnegligible willpower. (e.g. Stanford Marshmallow Experiment ).

Occasionally, someone gets swept up in a dopamine gradient too strong to resist, even with good reasons to stop. They overdose on drugs, they overeat unhealthy foods, they play video games for days until they die. And those are just some of the strongest dopamine gradients that humans have created.

We're seeing the beginning of the rise of cheap AI video generation. It's all over Youtube[1]. It's not good, but it's mesmerizing. It's bizarre, and it scratches some itch for some people. You can look it up if you're really morbidly curious, but I won't link anything, since the whole point of this section of the post is "don't get stuck in strong dopamine gradients from AI-generated content." When (not if) this technology does get Good, then we have cheap content generation with a powerful optimizer behind it, presumably trained well enough to grok what keeps humans engaged.

Maybe they already exist, maybe they will only exist later, but at some point I expect there to be people who spend significant time caught in loops of highly stimulating AI-optimized content beyond what is available from human creators. This prediction relies on a few specific things:

In this example, we see a utility maximizer which can be fooled by dopamine gradients (human), a recommendation algorithm, and a utility maximizer that has exactly one complex goal (content generator that maximizes engagement metrics). The optimization between these is mostly self-reinforcing; the only forces that push away from the stable state of "~everyone is watching superstimulating videos until they die" is the limits on how good the content generators and recommenders are and the willpower of the humans to do things like eat and get money. I am not confident relying on either of those things, given the increasing scaling in AI systems and the small amount of willpower that most people have access to.

Over-Integration of AI Cognition

Related non-prerequisite: AI Deception: A Survey of Examples, Risks, and Potential Solutions

The previous section details a failure mode that is targeted towards average-willpower, average-agency people. However, the high-willpower, highly agentic people are still at risk. These people want to do things, and they realize that they can pick up these giant piles of utility by using AIs as an external brain to enhance their cognition and agency even further. The more work you can successfully offload to AIs, the more room you have to be agentic and the more utility you can pick up.

But we cannot trust AIs to be a reliable external brain to us, just as we cannot reliably trust humans with that. Say you talk to a friend about something complex, you two work through the reasoning together, and you come to a conclusion that seems right, given the reasoning you just went through. You go home that night, let your mind wander, and you realize that one of the steps in the reasoning is subtly off upon further inspection. You have a reflex to generalize, and you notice that any of the other steps in reasoning that you skimmed over could also be similarly completely wrong, and they could be harder to disentangle than the one you just ran into.

LLMs are at the level where they can not only produce mistakes that mislead smart-but-not-omnicareful people in that way, but they can also produce intentional deceptions that mislead those people as well! I tested this: It took a bit of prompting and back-and-forth, but I was able to get o3-mini-high to generate deceptive arguments about ML (my area of most experience) that I couldn't find a flaw in, even knowing there was a flaw, even after seeing a hint about which step of reasoning it was in. Admittedly, it was not in an area of ML that I was particularly familiar with.[2] I later prompted trying to get it to provide similarly deceptive arguments for areas that I know very well, and it failed. I think that "can intentionally create successfully deceptive-to-me arguments in all but my relatively specific expertise" is a scarily high level of capabilities already, but I also expect that in the next generation or two of models, it will be able to pull this trick much more seamlessly in practically all domains.

If you are trying to pick up a lot of utility, you might try to offload a lot of cognition to LLMs by tightly integrating them into your thinking to massively accelerate whatever work you do. In cases where the LLMs are aligned, this Just Works and you get the utility, but if you don't have that guarantee of safety, you suddenly have a large attack surface for subtle deception like I've described. Here are some other related failure modes:

Does this End Up Mattering in Real Life?

Right now, at least, it seems rather unlikely that LLMs are secretly being deceptive and performing these subtle manipulations, even though they are in principle capable of it in most domains. What reasons are there for thinking about this?

There is of course the Security Mindset reasoning that you are uncomfortable with letting a counterfactual adversary into your cognitive processes, and you are uncomfortable with there being a way for such a counterfactual adversary to get in, even in principle.

However, there is also the fact that the appearance of serious deception problems is weighted much more towards the later end of AGI development, where models are becoming situationally aware and strategic (see #13 in AGI Ruin: A List of Lethalities). Working on this now is important work for preparing our future selves. Further, this is a capability that is very plausibly something that shows up after situational awareness and before ASI, as it may be very useful to deceive humans in order to get better opportunities for recursive self-improvement.

Finally, we can predict that the world is going to get very weird in the next few years before ASI. Weird in technological advancements, but also very quickly weird and tense in politics as the wider world wakes up to what is happening. If we expect to see any nation use AIs for a mass persuasion campaign, for example, then it is even more important to quickly become robust to AIs attempting to disrupt your cognitive loops.

In Search of Shovel-Ready Mitigations

There are some readers who will see this post and automatically keep these failure modes in mind and spend their time to cautiously reexamine the important aspects of their LLM usage. There are yet many more readers who would greatly benefit from some ready-to-go remedies. The only perfect remedies are "solve alignment" and "live in an internet-free bunker and never let AIs influence your cognition in any way," and things on this list are not intended to fill that gap. This list is not intended to be exhaustive; you are in fact highly encouraged to add to this list.

  1. ^

    I'm using Youtube as an example, but fill in the gaps with video games, social media, pornography, etc. if you find those more compelling. This argument holds for most (if not all) of the superstimuli that the internet has to offer. 

  2. ^

    Highly theoretical ML, stuff about the behavior of idealized limiting networks that don't actually represent most real use cases. I had to Google some stuff for the example o3-mini-high gave. I've interacted a bit with this area, but for simpler examples that output foundational facts like "networks under these idealized conditions are universal approximators for this class of functions."



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI软技能 认知安全 AI欺骗 超常刺激
相关文章