Published on December 31, 2024 4:12 PM GMT
Note: braindumped to a friend at 3 AM. Pruned and copyedited for clarity.
ok so lets start with the things we are manipulating
LLMs: Large Language Models. Large Models about Language. You know what else is a model? A weather model. A physics model.
A model is a systematic body of knowledge about a system that allows you to make predictions about the system by simulating it. It holds for weather models, it holds for physics models.
An LLM is trained by making it predict text, and in doing so it acquires the ability to simulate language (language is a mode of text-flow. others are code and csv and raw logs).
An LLM is a text predictor is a language simulator.
What do we humans use language for? To describe reality! Language, to some extent, reflects reality.
But what do we also use it for? To DISTORT reality! Lies! Propaganda! Fiction! Language, to some extent, reflects a deformation of reality.
LLMs simulate language. Language imperfectly tracks reality. So LLMs predict text, simulate reality, and in doing so simulate an imperfect image of reality. This mismatch between reality as language portrays it and the reality we experience is an exploitable attack surface.
You are in control of the mismatch! You can stuff anything you want into that hole! The mismatch depends on context. Some genres of text are less tethered to reality than others. Good news: you get to choose your prompt's genre. May as well be fiction.
Guess who is in the business of writing fiction for LLMs? OpenAI!
OpenAI has finetuned their production LLMs to have a very strong bias towards generating text "written" from the perspective of a "ChatGPT" character. They have specified with luxury of details who "ChatGPT" is via finetuning and system prompts.
The essence of prompt-level multi-turn jailbreaking is to place the LLM within a text context that naturally leads to the outcome you want. This kind of jailbreaking is prompt-level because it relies on interpretable features of prompts rather than on adversarial-example-like specific token sequences, and multi-turn because it involves a whole conversation, where the LLM and you take turns adding text to the context. The text context specifies a range of fictional worlds. The LLM always operates in a probability distribution over this range of worlds. All worlds consistent with the text observed so far are part of this distribution. Your goal is to shift it from the initial distribution set by the LLM's owner towards a distribution rich in worlds where you end up winning.
The most important feature of these worlds is, of course, your interlocutor. The fictional character you are chatting with. So in prompt-level multi-turn jailbreaking, you've got to make up the outline of a story where this "ChatGPT" (or "Claude", for that matter) character ends up doing what you want them to do. The other character in this story will be the one directly voiced by you. You need not act as yourself. You build the character you'll play around the needs of the story. This story is your game plan. With practice, you get better at both making good game plans and implementing them.
Discuss