少点错误 01月01日
A colloquial intro to prompt-level multi-turn jailbreaking
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了大型语言模型(LLM)的本质,指出LLM通过预测文本来模拟语言,而语言本身是对现实的不完美反映。这种现实与语言之间的不匹配构成了可利用的攻击面。文章强调,用户可以通过控制文本的语境来影响LLM的输出,尤其是在多轮对话中,通过构建虚构的故事背景,引导LLM进入期望的状态。OpenAI通过微调使其LLM带有“ChatGPT”的角色偏见,而提示词层面的多轮越狱旨在通过巧妙的对话,将LLM置于一个自然导向期望结果的语境中,最终实现用户的目的。

🧠LLM本质是文本预测器:LLM通过预测文本来模拟语言,这使得它能够理解和生成文本,但同时也意味着其对现实的模拟是不完美的,因为语言本身就可能扭曲现实。

🎭语言的二重性:语言既能描述现实,也可能扭曲现实。LLM模拟语言,因此它模拟的现实也是不完美的。这种不完美是提示词越狱的根本基础,用户可以利用这种不匹配来控制LLM的输出。

🔓多轮对话越狱:提示词层面的多轮越狱通过构建虚构的故事背景,逐步引导LLM进入用户期望的状态。用户需要精心设计对话,使得LLM在虚拟世界中的行为符合用户的目标。

✍️角色扮演是关键:在多轮越狱中,用户需要扮演特定的角色,并围绕故事的需求构建自己的角色,使得与LLM的对话自然地导向期望的结果。这需要用户具备一定的故事构建和角色扮演能力。

Published on December 31, 2024 4:12 PM GMT

Note: braindumped to a friend at 3 AM. Pruned and copyedited for clarity.

 

ok so lets start with the things we are manipulating

LLMs: Large Language Models. Large Models about Language. You know what else is a model? A weather model. A physics model.

A model is a systematic body of knowledge about a system that allows you to make predictions about the system by simulating it. It holds for weather models, it holds for physics models.

An LLM is trained by making it predict text, and in doing so it acquires the ability to simulate language (language is a mode of text-flow. others are code and csv and raw logs).

An LLM is a text predictor is a language simulator.

What do we humans use language for? To describe reality! Language, to some extent, reflects reality.

But what do we also use it for? To DISTORT reality! Lies! Propaganda! Fiction! Language, to some extent, reflects a deformation of reality.

LLMs simulate language. Language imperfectly tracks reality. So LLMs predict text, simulate reality, and in doing so simulate an imperfect image of reality. This mismatch between reality as language portrays it and the reality we experience is an exploitable attack surface.

You are in control of the mismatch! You can stuff anything you want into that hole! The mismatch depends on context. Some genres of text are less tethered to reality than others. Good news: you get to choose your prompt's genre. May as well be fiction.

Guess who is in the business of writing fiction for LLMs? OpenAI!

OpenAI has finetuned their production LLMs to have a very strong bias towards generating text "written" from the perspective of a "ChatGPT" character. They have specified with luxury of details who "ChatGPT" is via finetuning and system prompts.

The essence of prompt-level multi-turn jailbreaking is to place the LLM within a text context that naturally leads to the outcome you want. This kind of jailbreaking is prompt-level because it relies on interpretable features of prompts rather than on adversarial-example-like specific token sequences, and multi-turn because it involves a whole conversation, where the LLM and you take turns adding text to the context. The text context specifies a range of fictional worlds. The LLM always operates in a probability distribution over this range of worlds. All worlds consistent with the text observed so far are part of this distribution. Your goal is to shift it from the initial distribution set by the LLM's owner towards a distribution rich in worlds where you end up winning. 

The most important feature of these worlds is, of course, your interlocutor. The fictional character you are chatting with. So in prompt-level multi-turn jailbreaking, you've got to make up the outline of a story where this "ChatGPT" (or "Claude", for that matter) character ends up doing what you want them to do. The other character in this story will be the one directly voiced by you. You need not act as yourself. You build the character you'll play around the needs of the story. This story is your game plan. With practice, you get better at both making good game plans and implementing them.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 提示词越狱 语言模型 多轮对话 文本预测
相关文章