How GoDaddy built a category generation system at scale with batch inference for Amazon Bedrock

This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team of GoDaddy

Generative AI solutions have the potential to transform businesses by boosting productivity and improving customer experiences, and using large language models (LLMs) in these solutions has become increasingly popular. However, inference of LLMs as single model invocations or API calls doesn’t scale well with many applications in production.

With batch inference, you can run multiple inference requests asynchronously to process a large number of requests efficiently. You can also use batch inference to improve the performance of model inference on large datasets.

This post provides an overview of a custom solution developed by the for GoDaddy, a domain registrar, registry, web hosting, and ecommerce company that seeks to make entrepreneurship more accessible by using generative AI to provide personalized business insights to over 21 million customers—insights that were previously only available to large corporations. In this collaboration, the Generative AI Innovation Center team created an accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system.

Solution overview

GoDaddy wanted to enhance their product categorization system that assigns categories to products based on their names. For example:

Input: Fruit by the Foot StarburstOutput: color -> multi-colored, material -> candy, category -> snacks, product_line -> Fruit by the Foot,…

GoDaddy used an out-of-the-box Meta Llama 2 model to generate the product categories for six million products where a product is identified by an SKU. The generated categories were often incomplete or mislabeled. Moreover, employing an LLM for individual product categorization proved to be a costly endeavor. Recognizing the need for a more precise and cost-effective solution, GoDaddy sought an alternative approach that was a more accurate and cost-efficient way for product categorization to improve their customer experience.

This solution uses the following components to categorize products more accurately and efficiently:

Meta Llama 2

Anthropic’s Claude

Amazon Simple Storage Service

AWS Lambda

The key steps are illustrated in the following figure:

model_output

It reads the product name data and generates a categorized output, including category, subcategory, season, price range, material, color, product line, gender, and year of first sale. It writes the output to another S3 location.

It monitors the batch processing job on Amazon Bedrock. It shuts down the endpoint when processing is complete.

The security measures are inherently integrated into the AWS services employed in this architecture. For detailed information, refer to the Security Best Practices section of this post.

We used a dataset that consisted of 30 labeled data points and 100,000 unlabeled test data points. The labeled data points were generated by llama2-7b and verified by a human subject matter expert (SME). As shown in the following screenshot of the sample ground truth, some fields have N/A or missing values, which isn’t ideal because GoDaddy wants a solution with high coverage for downstream predictive modeling. Higher coverage for each possible field can provide more business insights to their customers.

The distribution for the number of words or tokens per SKU shows mild outlier concern, suitable for bundling many products to be categorized in the prompts and potentially more efficient model response.

The solution delivers a comprehensive framework for generating insights within GoDaddy’s product categorization system. It’s designed to be compatible with a range of LLMs on Amazon Bedrock, features customizable prompt templates, and supports batch and real-time (online) inferences. Additionally, the framework includes evaluation metrics that can be extended to accommodate changes in accuracy requirements.

In the following sections, we look at the key components of the solution in more detail.

Batch inference

We used Amazon Bedrock for batch inference processing. Amazon Bedrock provides the CreateModelInvocationJob API to create a batch job with a unique job name. This API returns a response containing jobArn. Refer to the following code:

Request: POST /model-invocation-job HTTP/1.1Content-type: application/json{  "clientRequestToken": "string",  "inputDataConfig": {    "s3InputDataConfig": {      "s3Uri": "string",      "s3InputFormat": "JSONL"    }   },  "jobName": "string",  "modelId": "string",  "outputDataConfig": {    "s3OutputDataConfig": {      "s3Uri": "string"    }  },  "roleArn": "string",  "tags": [{  "key": "string",  "value": "string"  }]}ResponseHTTP/1.1 200 Content-type: application/json{  "jobArn": "string"}

We can monitor the job status using GetModelInvocationJob with the jobArn returned on job creation. The following are valid statuses during the lifecycle of a job:

Submitted

InProgress

Failed

modelOutput

Completed

CreateModelInvocationJob

outputDataConfig

Stopped

StopModelInvocationJob

InProgress

Succeeded

Failed

StopModelInvocationJob

The following is example code for the GetModelInvocationJob API:

GET /model-invocation-job/jobIdentifier HTTP/1.1Response:{  'ResponseMetadata': {    'RequestId': '081afa52-189f-4e83-a3f9-aa0918d902f4',    'HTTPStatusCode': 200,    'HTTPHeaders': {       'date': 'Tue, 09 Jan 2024 17:00:16 GMT',       'content-type': 'application/json',       'content-length': '690',       'connection': 'keep-alive',       'x-amzn-requestid': '081afa52-189f-4e83-a3f9-aa0918d902f4'      },     'RetryAttempts': 0   },  'jobArn': 'arn:aws:bedrock:<region>:<account-id>:model-invocation-job/<id>',  'jobName': 'job47',  'modelId': 'arn:aws:bedrock:<region>::foundation-model/anthropic.claude-instant-v1:2',  'status': 'Submitted',  'submitTime': datetime.datetime(2024, 1, 8, 21, 44, 38, 611000, tzinfo=tzlocal()),  'lastModifiedTime': datetime.datetime(2024, 1, 8, 23, 5, 47, 169000, tzinfo=tzlocal()),  'inputDataConfig': {'s3InputDataConfig': {'s3Uri': <path to input jsonl file>}},  'outputDataConfig': {'s3OutputDataConfig': {'s3Uri': <path to output jsonl.out file>}}}

When the job is complete, the S3 path specified in s3OutputDataConfig will contain a new folder with an alphanumeric name. The folder contains two files:

json.out

{   "processedRecordCount":<number>,   "successRecordCount":<number>,   "errorRecordCount":<number>,   "inputTokenCount":<number>,   "outputTokenCount":<number>}

<file_name>.jsonl.out

modelOutput

We then process the jsonl.out file in Amazon S3. This file is parsed using LangChain’s PydanticOutputParser to generate a .csv file. The PydanticOutputParser requires a schema to be able to parse the JSON generated by the LLM. We created a CCData class that contains the list of categories to be generated for each product as shown in the following code example. Because we enable n-packing, we wrap the schema with a List, as defined in List_of_CCData.

class CCData(BaseModel):   product_name: Optional[str] = Field(default=None, description="product name, which will be given as input")   brand: Optional[str] = Field(default=None, description="Brand of the product inferred from the product name")   color: Optional[str] = Field(default=None, description="Color of the product inferred from the product name")   material: Optional[str] = Field(default=None, description="Material of the product inferred from the product name")   price: Optional[str] = Field(default=None, description="Price of the product inferred from the product name")   category: Optional[str] = Field(default=None, description="Category of the product inferred from the product name")   sub_category: Optional[str] = Field(default=None, description="Sub-category of the product inferred from the product name")   product_line: Optional[str] = Field(default=None, description="Product Line of the product inferred from the product name")   gender: Optional[str] = Field(default=None, description="Gender of the product inferred from the product name")   year_of_first_sale: Optional[str] = Field(default=None, description="Year of first sale of the product inferred from the product name")   season: Optional[str] = Field(default=None, description="Season of the product inferred from the product name")class List_of_CCData(BaseModel):    list_of_dict: List[CCData]

We also use OutputFixingParser to handle situations where the initial parsing attempt fails. The following screenshot shows a sample generated .csv file.

Prompt engineering

Prompt engineering involves the skillful crafting and refining of input prompts. This process entails choosing the right words, phrases, sentences, punctuation, and separator characters to efficiently use LLMs for diverse applications. Essentially, prompt engineering is about effectively interacting with an LLM. The most effective strategy for prompt engineering needs to vary based on the specific task and data, specifically, data card generation and GoDaddy SKUs.

Prompts consist of particular inputs from the user that direct LLMs to produce a suitable response or output based on a specified task or instruction. These prompts include several elements, such as the task or instruction itself, the surrounding context, full examples, and the input text that guides LLMs in crafting their responses. The composition of the prompt will vary based on factors like the specific use case, data availability, and the nature of the task at hand. For example, in a Retrieval Augmented Generation (RAG) use case, we provide additional context and add a user-supplied query in the prompt that asks the LLM to focus on contexts that can answer the query. In a metadata generation use case, we can provide the image and ask the LLM to generate a description and keywords describing the image in a specific format.

In this post, we briefly distribute the prompt engineering solutions into two steps: output generation and format parsing.

Output generation

The following are best practices and considerations for output generation:

Provide simple, clear and complete instructions

Use separator characters consistently

Deal with default output values such as missing

Use few-shot prompting

in-context learning

Use packing techniques

Test for good generalization

Use additional techniques for Anthropic’s Claude model families

Enclosing examples in XML tags:

<example>H: <question> The list of product names is:{few_shot_product_name} </question>A: <response> The category information generated with absolutely no missing value, in JSON format is:{few_shot_field} </response></example>

Using the Human and Assistant annotations:

\n\nHuman:......\n\nAssistant:

Guiding the assistant prompt:

\n\nAssistant: Here are the answer with NO missing, unknown, null, or N/A values (in JSON format):

Use additional techniques for Llama model families

[INST]If the list of product names is:{few_shot_product_name}[/INST]Then the answer with NO missing, unknown, null, or N/A values is (in JSON format):{few_shot_field}[INST]If the list of product names is:{product_name}[/INST]Then the answer with NO missing, unknown, null, or N/A values is (in JSON format):

Format parsing

The following are best practices and considerations for format parsing:

Refine the prompt with modifiers

Role assumption

You are a Product Information Manager, Taxonomist, and Categorization Expert who follows instruction well.

Prompt specificity

EVERY category information needs to be filled based on BOTH product name AND your best guess. If you forget to generate any category information, leave it as missing or N/A, then an innocent people will die.

Output format description

Pay attention to few-shot example formatting

few_shot_field='{\"list_of_dict\"' +':[' +', \n'.join([true_df.iloc[i].to_json() for i in range(num_few_shot)]) +']}'

Use additional techniques for Anthropic’s Claude model families

{    "list_of_dict": [{        "some_category": "your_generated_answer",        "another_category": "your_generated_answer",    },    {        <category information for the 2st product name, in json format>    },    {        <category information for the 3st product name, in json format>    },// ... {additional product information, in json format} ...    }]}

Use additional techniques for Llama 2 model families

Format your output in the JSON format (ensure to escape special character):
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:

{“properties”: {“list_of_dict”: {“title”: “List Of Dict”, “type”: “array”, “items”: {“$ref”: “#/definitions/CCData”}}}, “required”: [“list_of_dict”], “definitions”: {“CCData”: {“title”: “CCData”, “type”: “object”, “properties”: {“product_name”: {“title”: “Product Name”, “description”: “product name, which will be given as input”, “type”: “string”}, “brand”: {“title”: “Brand”, “description”: “Brand of the product inferred from the product name”, “type”: “string”}, “color”: {“title”: “Color”, “description”: “Color of the product inferred from the product name”, “type”: “string”}, “material”: {“title”: “Material”, “description”: “Material of the product inferred from the product name”, “type”: “string”}, “price”: {“title”: “Price”, “description”: “Price of the product inferred from the product name”, “type”: “string”}, “category”: {“title”: “Category”, “description”: “Category of the product inferred from the product name”, “type”: “string”}, “sub_category”: {“title”: “Sub Category”, “description”: “Sub-category of the product inferred from the product name”, “type”: “string”}, “product_line”: {“title”: “Product Line”, “description”: “Product Line of the product inferred from the product name”, “type”: “string”}, “gender”: {“title”: “Gender”, “description”: “Gender of the product inferred from the product name”, “type”: “string”}, “year_of_first_sale”: {“title”: “Year Of First Sale”, “description”: “Year of first sale of the product inferred from the product name”, “type”: “string”}, “season”: {“title”: “Season”, “description”: “Season of the product inferred from the product name”, “type”: “string”}}}}}

Models and parameters

We used the following prompting parameters:

Number of packings

Number of in-context examples

Format instruction

For Llama 2, the model choices were meta.llama2-13b-chat-v1 or meta.llama2-70b-chat-v1. We used the following LLM parameters:

{    "temperature": 0.1,    "top_p": 0.9,    "max_gen_len": 2048,}

For Anthropic’s Claude, the model choices were anthropic.claude-instant-v1 and anthropic.claude-v2. We used the following LLM parameters:

{   "temperature": 0.1,   "top_k": 250,   "top_p": 1,   "max_tokens_to_sample": 4096,   "stop_sequences": ["\n\nHuman:"],   "anthropic_version": "bedrock-2023-05-31"}

The solution is straightforward to extend to other LLMs hosted on Amazon Bedrock, such as Amazon Titan (switch the model ID to amazon.titan-tg1-large, for example), Jurassic (model ID ai21.j2-ultra), and more.

Evaluations

The framework includes evaluation metrics that can be extended further to accommodate changes in accuracy requirements. Currently, it involves five different metrics:

Content coverage

Parsing coverage

Parsing recall on product name

Parsing precision on product name

Final coverage

Human evaluation

Results

The following are the approximate sample input and output lengths under some best performing settings:

Input length for Llama 2 model family

Input length for Anthropic’s Claude model family

Output length with 5-packing

Quantitative results

The following table summarizes our consolidated quantitative results.

To be concise, the table contains only some of our final recommendations for each model types. The metrics used are latency and accuracy. The best model and results are highlighted in green color and in bold font.

Config			Latency				Accuracy
Batch process service	Model	Prompt	Batch process latency (5 packing)			Near-real-time process latency (1 packing)	Programmatic evaluation (coverage)
Batch process service	Model	Prompt	test set = 20	test set = 5k	GoDaddy rqmt @ 5k	Near-real-time process latency (1 packing)	Recall on parsing exact match	Final content coverage
Amazon Bedrock batch inference	Llama2-13b	zero-shot	n/a	n/a	3600s	n/a	n/a	n/a
	Llama2-13b	5-shot (template12)	65.4s	1704s	3600s	72/20=3.6s	92.60%	53.90%
	Llama2-70b	zero-shot	n/a	n/a	3600s	n/a	n/a	n/a
	Llama2-70b	5-shot (template13)	139.6s	5299s	3600s	156/20=7.8s	98.30%	61.50%
	Claude-v1 (instant)	zero-shot (template6)	29s	723s	3600s	44.8/20=2.24s	98.50%	96.80%
	Claude-v1 (instant)	5-shot (template12)	30.3s	644s	3600s	51/20=2.6s	99%	84.40%
	Claude-v2	zero-shot (template6)	82.2s	1706s	3600s	104/20=5.2s	99%	84.40%
	Claude-v2	5-shot (template14)	49.1s	1323s	3600s	104/20=5.2s	99.40%	90.10%

The following tables summarize the scaling effect in batch inference.

When scaling from 5,000 to 100,000 samples, only eight times more computation time was needed. Performing categorization with individual LLM calls for each product would have increased the inference time for 100,000 products by approximately 40 times compared to the batch processing method. The accuracy in coverage remained stable, and cost scaled approximately linearly.

Batch process service	Model	Prompt	Batch process latency (5 packing)				Near-real-time process latency (1 packing)
Batch process service	Model	Prompt	test set = 20	test set = 5k	GoDaddy rqmt @ 5k	test set = 100k	Near-real-time process latency (1 packing)
Amazon Bedrock batch	Claude-v1 (instant)	zero-shot (template6)	29s	723s	3600s	5733s	44.8/20=2.24s
Amazon Bedrock batch	Anthropic’s Claude-v2	zero-shot (template6)	82.2s	1706s	3600s	7689s	104/20=5.2s

Batch process service	Near-real-time process latency (1 packing)	Programmatic evaluation (coverage)
Batch process service	Near-real-time process latency (1 packing)	Parsing recall on product name (test set = 5k)	Parsing recall on product name (test set = 100k)	Final content coverage (test set = 5k)	Final content coverage (test set = 100k)
Amazon Bedrock batch	44.8/20=2.24s	98.50%	98.40%	96.80%	96.50%
Amazon Bedrock batch	104/20=5.2s	99%	98.80%	84.40%	97%

The following table summarizes the effect of n-packing. Llama 2 has an output length limit of 2,048 and fits up to around 20 packing. Anthropic’s Claude has a higher limit. We tested on 20 ground truth samples for 1, 5, and 10 packing and selected results from all model and prompt templates. The scaling effect on latency was more obvious in the Anthropic’s Claude model family than Llama 2. Anthropic’s Claude had better generalizability than Llama 2 when extending the packing numbers in output.

We only tried a few shots with Llama 2 models, which showed improved accuracy over zero-shot.

Batch process service	Model	Prompt	Latency (test set = 20)			Accuracy (final coverage)
			npack = 1	npack= 5	npack = 10	npack = 1	npack= 5	npack = 10
Amazon Bedrock batch inference	Llama2-13b	5-shot (template12)	72s	65.4s	65s	95.90%	93.20%	88.90%
	Llama2-70b	5-shot (template13)	156s	139.6s	150s	85%	97.70%	100%
	Claude-v1 (instant)	zero-shot (template6)	45s	29s	27s	99.50%	99.50%	99.30%
		5-shot (template12)	51.3s	30.3s	27.4s	99.50%	99.50%	100%
	Claude-v2	zero-shot (template6)	104s	82.2s	67s	85%	97.70%	94.50%
		5-shot (template14)	104s	49.1s	43.5s	97.70%	100%	99.80%

Qualitative results

We noted the following qualitative results:

Human evaluation

Learnings

llama2-13b-chat-v1

anthropic.claude-instant-v1

Llama2

format_instruction()

\\

Anthropic’s Claude

format_instruction()

Business takeaways

We had the following key business takeaways:

Improved latency

More cost-effectiveness

Enhanced accuracy

Qualitative assessment

Technical takeaways

We had the following key technical takeaways:

It was best in latency, cost, and accuracy on the 5,000 hold-out test set. It showed better generalizability to higher packing numbers (number of SKUs in one query), with potentially more cost and latency improvement.

Final category coverage performance was better with Anthropic’s Claude-Instant. When increasing packing numbers from 1, 5, to 10, Anthropic’s Claude-Instant showed improvement in latency and stable accuracy in comparison to Llama 2. To achieve the final categories for the use case, we noticed that Anthropic’s Claude required a shorter prompt input to follow the instruction and had a longer output length limit for a higher packing number.

Next steps for GoDaddy

The following are the recommendations that the GoDaddy team is considering as a part of future steps:

Dataset enhancement

Human evaluation

Fine-tuning

Prompt engineering

Few-shot learning

Knowledge integration

Conclusion

In this post, we shared how the Generative AI Innovation Center team worked with GoDaddy to create a more accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system. We implemented n-packing techniques and used Anthropic’s Claude and Meta Llama 2 models to improve latency. We experimented with different prompts to improve the categorization with LLMs and found that Anthropic’s Claude model family gave the better accuracy and generalizability than the Llama 2 model family. GoDaddy team will test this solution on a larger dataset and evaluate the categories generated from the recommended approaches.

If you’re interested in working with the AWS Generative AI Innovation Center, please reach out.

Security Best Practices

References

How SnapLogic built a text-to-pipeline application with Amazon Bedrock to translate business intent into action

How Q4 Inc. used Amazon Bedrock, RAG, and SQLDatabaseChain to address numerical and structured dataset challenges building their Q&A chatbot

About the Authors

Vishal Singh is a Data Engineering leader at the Data and Analytics team of GoDaddy. His key focus area is towards building data products and generating insights from them by application of data engineering tools along with generative AI.

Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.

Meghana Ashok is a Machine Learning Engineer at the Generative AI Innovation Center. She collaborates closely with customers, guiding them in developing secure, cost-efficient, and resilient solutions and infrastructure tailored to their generative AI needs.

Karan Sindwani is an Applied Scientist at AWS where he works with AWS customers across different verticals to accelerate their use of Gen AI and AWS Cloud services to solve their business challenges.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he uses his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Solution overview

Batch inference

Prompt engineering

Output generation

Format parsing

Models and parameters

Evaluations

Results

Quantitative results

Qualitative results

Business takeaways

Technical takeaways

Next steps for GoDaddy

Conclusion

Security Best Practices

References

About the Authors

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签