Recursal AI development blog 2024年12月13日
Featherless Feud: a dip into LLM-powered game development
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨如何运用LLM构建电视经典游戏《家庭问答》的网络版。该游戏在输入、输出、UI等方面存在诸多问题,文中着重关注内容相关挑战,如问题与答案的生成及模糊答案匹配。通过提示LLM生成特定文本并解释结果,还介绍了生成问题的方法及确保输出为有效JSON的技巧,以及解决模糊答案匹配的示例。

🎮利用LLM生成游戏问题及常见答案,并为每个答案设定0到100的常见程度评分。

📄通过明确的输出结构描述及特定的推理技术,确保LLM输出有效的JSON格式。

🤔解决玩家答案与官方答案的模糊匹配问题,通过示例展示了相应的处理方法。

It takes time to develop a sense of how to apply LLMs to concrete problems. But a great way to do this is to build with them; in this post, we’ll work through building a web version of the TV classic Family Feud.

This is a game with elements that are quite challenging for “traditional” software development that are remarkably easy with an LLM. If you want to try what we built before reading further, you can do so here!

The Challenges

Building a game requires solving lots of problems around input, output, UI, visuals, latency, the list goes on. We’ll be focussing on the content related challenges that are not easy for the “rules-based” nature of typical software.

When we think about building Family Feud - or Featherless Feud as we call this version - we hit two problems:

    question and answer generation

    fuzzy answer matching

i.e. how do come up with quiz questions, and when given answers that aren’t exactly on the list, how do we match them?

For the actual show, the question bank is built with the help of surveys of the general public, meanwhile identifying variation of official answers are done by show staff live, a natural language task which is simple for humans, but challenging for software.

In both cases, we’ll be prompting an LLM to generate a certain kind of text and interpreting the results, but beyond just prompting, how we run the inference also matters. Let’s dive in.

Part 1 - Question/Answers Generation

To generate questions, we might prompt an LLM like so

You are the producer of the game show Family Feud.Your job is to devise a question and a list of common answers to this question. For each answer, output a number between 0 and 100 which is how common you think the answer would be when given by a member of the general public. The sum of the scores should not exceed 100.

Now that content has structure (e.g. question versus answers) that we need to know about so that the various pieces can be handled appropriately by the different parts of our app.

The LLM “knows” about the relationship between the questions and answers in an output, so we can ask it to include the annotations in the output. We really could ask the LLM to output any kind of structured format but the most common format to request output in is JSON.

As for getting a model to formulate it’s output as JSON there are typically two parts:

    prompting for the answer to be in JSON with a description of desired structure

    requesting JSON from your inference API call

For 1., you might update your prompt as follows

You are the producer of the game show Family Feud.Your job is to devise a question and a list of common answers to this question. For each answer, output a number between 0 and 100 which is how common you think the answer would be when given by a member of the general public. The sum of the scores should not exceed 100.Be very concise, present only the question and answers, and don't add anything else.Output valid JSON with the following format: two keys, `question`, which should be a string, and `answers` should be an array of objects. Each object in the `answers` array should have two keys, `answer` and `score`, each of which should be a string.e.g.```json{  "question": "Name something that might be wobbly",  "answers": [    { "answer": "Furniture" , score: 64 },    { "answer": "Person/A Drunk", score: 15 },    { "answer": "Spinning toy/top", score: 5 },    { "answer": "Shopping Cart", score: 3 }  ]}```The sum of the scores should not exceed 100.

Being prescriptive in structure is important; if the output description isn’t clear enough to the model, you’ll get variation in structure and keys used, which likely cause an exception when the part of your software operating on the JSON output tries to separate the question from the answers.

Though even with an explicit description of output structure, the model to output syntactically valid JSON is it is unlikely that if you prompting only will have your output will be valid JSON. Here’s a sample of the output from above:

Here is a question with a list of answers and scores:{  "question": "Name something you might find in a purse or wallet",  "answers": [    { "answer": "Money", score: 70 },    { "answer": "Phone", score: 20 },    { "answer": "Credit Card", score: 5 },    { "answer": "Candy", score: 2 },    { "answer": "Makeup", score: 3 }  ]}

Pretty good right? It came up with a question, and in the format as requested (and needed for the game). It also outputted the `score` values as Numbers even though I asked, in the prompt, for their input to be strings! The model overrode my instructions, but for the better. However despite the request to give nothing but the output, there still is some pre-amble which requires it’s own intelligence to parse. How do we get rid of this?

The other half the equation is an inference technique known as “guided decoding” in it’s most general form, but known as JSON mode. Here we guarantee the output is syntactically valid JSON by manipulating token probabilities during sampling to prevent syntactically invalid JSON from appearing (e.g. the string {‘‘} is not valid JSON).

Of course these two modes are in tension with another - if you don’t ask for JSON in your prompt (or specify structure clearly enough), and put on JSON mode, it’s possible that your output bears little relationship to your input. In an early version of the prompt above, I got a number of questions.

How you request JSON mode varies by your API provider (this is a specific type of guided decoding which is a big topic), but many (most?) providers that have an API will accept the same format to the request as OpenAI, which is to include a

"request_format": { "type": "json_object" }

in the body of the inference POST request. Once this is included, my output is exactly how I hoped: exactly the JSON and only the JSON.

{  "question": "Name something you might find in a garage",  "answers": [    { "answer": "Tools", "score": 40 },    { "answer": "Bike", "score": 20 },    { "answer": "Work Bench", "score": 12 },    { "answer": "Vacuum", "score": 8 },    { "answer": "Lawn Mower", "score": 5 },    { "answer": "Storage Containers", "score": 5 }  ]}

Note generation takes longer when JSON mode is enabled. This is a consequence of how that guided decoding works, which is worth a blog post in it’s own right.

We haven’t talked about the Featherless API we’ve been using for inference, but hold on to that thought - more on that in a bit.

Part 2 - Fuzzy Answer Matching

The other challenge is matching players answers to the list of official answers, as they don’t always match, in our running example from the video above, the question “Name something that might be wobbly” had the following answers

Name something that might be wobbly - “A stool” and “A baby” were both accepted answers

Where “A stool” was accepted in place of “Furniture” and “A baby” was accepted as a version of “Person / a drunk”.

It’s relatively easy to devise a prompt and post-processing that captures the gist of the idea. Here’s a short python program we used during testing to let you iterate through prompt and model choice by letting an answer you supply on the command line is judged.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
from openai import OpenAI
import os
import click
client = OpenAI(
base_url="https://api.featherless.ai/v1",
api_key=os.environ.get('FEATHERLESS_API_KEY')
)
@click.command()
@click.argument('answer')
@click.option('--model', default='meta-llama/Meta-Llama-3-8B-Instruct')
def quiz(answer, model):
question = "Name something that might be wobbly"
official_answers = [
"Furniture",
"Person/A Drunk",
"Spinning toy/top",
"Shopping Cart",
]
system_prompt = f"""You are a judge on the show Family Feud.
You're going to receive a guess from a contestant. That guess is a guess at one of the answers to the question
Question is "{question}"
The official answers are
{'\n'.join([ f"* {a}" for a in official_answers ])}
If the candidate's guess is a version of an official answer, please respond the wording of the official answer.
If the candidate's guess is not a version of any official answer, respond with "Survey says ... AERR!!"
"""
print(model)
print(system_prompt)
chat_completions_response = client.chat.completions.create(
model=model,
messages=[
{ "role": "system", "content": system_prompt},
{ "role": "user", "content": answer }
],
max_tokens=250,
)
print(chat_completions_response.choices[0].message.content)
if __name__ == "__main__":
quiz()

But catching the comedic subtlety of the game is something else entirely. Despite many prompt iterations and many different models I tried (including Llama 3.1 405B!), I couldn’t get the judge to accept “A baby” as a variation of the answer “Person / A drunk”. If you have a different result, I’d love to hear it in the comments!

Choosing an inference provider

There is a growing list of inference providers to choose from. Every foundation model company provides API access to their models (OpenAI, Anthropic, etc.), companies focussed just on inference provision (friendli.ai, replicate.ai, together.ai), and aggregators (e.g. openrouter.ai).

What’s unique about featherless.ai is the number of models available serverlessly, i.e. without paying for dedicated GPUs. With the exception of featherless, all other serverless providers only make the most popular models are available, since, behind the scenes, there is dedicated infrastructure to those models, those costs are just being amortized across a large enough group of consumers. Featherless is different. Every fine-tune of a large class of models (i.e. all fine-tunes of a specific set of base architectures) is available serverlessly. Our goal is to make every public model on HuggingFace available for serverless inference and we currently have over 2k models available.

https://featherless.ai/models

The next closest service in model count is openrouter.ai, which aggregates providers, and has less than 10% of the available models.

Model Selection and Testing

Play testing is an important part of any game development, and as we were playing with Featherless Feud, we tried a number of different models. The primary issues were a lack of diverse answers or answers that were too ambiguous for the game format. We’ve left the utility box that overrides the model in the lower-right hand corner of the game; this showcases some of the strength of the featherless platform: take any of our 2k models from our models page, drop it into that selector, and the game is instantly updated. When contrasted that just model download times are tens of minutes for small models, you might appreciate this feat.

the text box in the lower right

We tested a range of model architectures, sizes, and data-sets, including

At a high-level, the more RP-focussed models tended to generate more variety of questions, and the bigger models didn’t generate content with a sufficient jump in quality to justify the longer running time. So we settled on Stheno as it provided a balance of variety of output with fastest inference time.

Try running the game with some of the models and let us know in the comments what you differences you notice!

Wrapping up

We learned a ton in this journey, and hope you did too. Perhaps most surprising is that capturing the comedic subtlety of Family Feud escaped us even when using the more capable (i.e. larger) open-source LLMs.

Thanks for reading! Let us know how you enjoyed this article, either in the comments, or with an email to hello@featherless.ai, and head on over to featherless.ai to experiment with the models listed here and more.

And again, check out the running game here

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 家庭问答 问题生成 答案匹配 JSON格式
相关文章