AWS Machine Learning Blog 03月14日
How GoDaddy built a category generation system at scale with batch inference for Amazon Bedrock
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

GoDaddy使用生成式AI解决方案,通过批处理推理提高产品分类系统的准确性和效率。该方案采用多种技术和模型,旨在为客户提供更精准的个性化商业洞察。

🎯GoDaddy欲提升产品分类系统,原分类存在问题

🛠使用多种技术组件,包括Amazon Bedrock等

📋关键步骤如上传数据、模型处理和监控等

📊解决方案具兼容性、可定制性及评估指标

This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team of GoDaddy

Generative AI solutions have the potential to transform businesses by boosting productivity and improving customer experiences, and using large language models (LLMs) in these solutions has become increasingly popular. However, inference of LLMs as single model invocations or API calls doesn’t scale well with many applications in production.

With batch inference, you can run multiple inference requests asynchronously to process a large number of requests efficiently. You can also use batch inference to improve the performance of model inference on large datasets.

This post provides an overview of a custom solution developed by the for GoDaddy, a domain registrar, registry, web hosting, and ecommerce company that seeks to make entrepreneurship more accessible by using generative AI to provide personalized business insights to over 21 million customers—insights that were previously only available to large corporations. In this collaboration, the Generative AI Innovation Center team created an accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system.

Solution overview

GoDaddy wanted to enhance their product categorization system that assigns categories to products based on their names. For example:

Input: Fruit by the Foot StarburstOutput: color -> multi-colored, material -> candy, category -> snacks, product_line -> Fruit by the Foot,…

GoDaddy used an out-of-the-box Meta Llama 2 model to generate the product categories for six million products where a product is identified by an SKU. The generated categories were often incomplete or mislabeled. Moreover, employing an LLM for individual product categorization proved to be a costly endeavor. Recognizing the need for a more precise and cost-effective solution, GoDaddy sought an alternative approach that was a more accurate and cost-efficient way for product categorization to improve their customer experience.

This solution uses the following components to categorize products more accurately and efficiently:

The key steps are illustrated in the following figure:

    A JSONL file containing product data is uploaded to an S3 bucket, triggering the first Lambda function. Amazon Bedrock batch processes this single JSONL file, where each row contains input parameters and prompts. It generates an output JSONL file with a new model_output value appended to each row, corresponding to the input data. The Lambda function spins up an Amazon Bedrock batch processing endpoint and passes the S3 file location. The Amazon Bedrock endpoint performs the following tasks:
      It reads the product name data and generates a categorized output, including category, subcategory, season, price range, material, color, product line, gender, and year of first sale. It writes the output to another S3 location.
    The second Lambda function performs the following tasks:
      It monitors the batch processing job on Amazon Bedrock. It shuts down the endpoint when processing is complete.

The security measures are inherently integrated into the AWS services employed in this architecture. For detailed information, refer to the Security Best Practices section of this post.

We used a dataset that consisted of 30 labeled data points and 100,000 unlabeled test data points. The labeled data points were generated by llama2-7b and verified by a human subject matter expert (SME). As shown in the following screenshot of the sample ground truth, some fields have N/A or missing values, which isn’t ideal because GoDaddy wants a solution with high coverage for downstream predictive modeling. Higher coverage for each possible field can provide more business insights to their customers.

The distribution for the number of words or tokens per SKU shows mild outlier concern, suitable for bundling many products to be categorized in the prompts and potentially more efficient model response.

The solution delivers a comprehensive framework for generating insights within GoDaddy’s product categorization system. It’s designed to be compatible with a range of LLMs on Amazon Bedrock, features customizable prompt templates, and supports batch and real-time (online) inferences. Additionally, the framework includes evaluation metrics that can be extended to accommodate changes in accuracy requirements.

In the following sections, we look at the key components of the solution in more detail.

Batch inference

We used Amazon Bedrock for batch inference processing. Amazon Bedrock provides the CreateModelInvocationJob API to create a batch job with a unique job name. This API returns a response containing jobArn. Refer to the following code:

Request: POST /model-invocation-job HTTP/1.1Content-type: application/json{  "clientRequestToken": "string",  "inputDataConfig": {    "s3InputDataConfig": {      "s3Uri": "string",      "s3InputFormat": "JSONL"    }   },  "jobName": "string",  "modelId": "string",  "outputDataConfig": {    "s3OutputDataConfig": {      "s3Uri": "string"    }  },  "roleArn": "string",  "tags": [{  "key": "string",  "value": "string"  }]}ResponseHTTP/1.1 200 Content-type: application/json{  "jobArn": "string"}

We can monitor the job status using GetModelInvocationJob with the jobArn returned on job creation. The following are valid statuses during the lifecycle of a job:

The following is example code for the GetModelInvocationJob API:

GET /model-invocation-job/jobIdentifier HTTP/1.1Response:{  'ResponseMetadata': {    'RequestId': '081afa52-189f-4e83-a3f9-aa0918d902f4',    'HTTPStatusCode': 200,    'HTTPHeaders': {       'date': 'Tue, 09 Jan 2024 17:00:16 GMT',       'content-type': 'application/json',       'content-length': '690',       'connection': 'keep-alive',       'x-amzn-requestid': '081afa52-189f-4e83-a3f9-aa0918d902f4'      },     'RetryAttempts': 0   },  'jobArn': 'arn:aws:bedrock:<region>:<account-id>:model-invocation-job/<id>',  'jobName': 'job47',  'modelId': 'arn:aws:bedrock:<region>::foundation-model/anthropic.claude-instant-v1:2',  'status': 'Submitted',  'submitTime': datetime.datetime(2024, 1, 8, 21, 44, 38, 611000, tzinfo=tzlocal()),  'lastModifiedTime': datetime.datetime(2024, 1, 8, 23, 5, 47, 169000, tzinfo=tzlocal()),  'inputDataConfig': {'s3InputDataConfig': {'s3Uri': <path to input jsonl file>}},  'outputDataConfig': {'s3OutputDataConfig': {'s3Uri': <path to output jsonl.out file>}}}

When the job is complete, the S3 path specified in s3OutputDataConfig will contain a new folder with an alphanumeric name. The folder contains two files:

{   "processedRecordCount":<number>,   "successRecordCount":<number>,   "errorRecordCount":<number>,   "inputTokenCount":<number>,   "outputTokenCount":<number>}

We then process the jsonl.out file in Amazon S3. This file is parsed using LangChain’s PydanticOutputParser to generate a .csv file. The PydanticOutputParser requires a schema to be able to parse the JSON generated by the LLM. We created a CCData class that contains the list of categories to be generated for each product as shown in the following code example. Because we enable n-packing, we wrap the schema with a List, as defined in List_of_CCData.

class CCData(BaseModel):   product_name: Optional[str] = Field(default=None, description="product name, which will be given as input")   brand: Optional[str] = Field(default=None, description="Brand of the product inferred from the product name")   color: Optional[str] = Field(default=None, description="Color of the product inferred from the product name")   material: Optional[str] = Field(default=None, description="Material of the product inferred from the product name")   price: Optional[str] = Field(default=None, description="Price of the product inferred from the product name")   category: Optional[str] = Field(default=None, description="Category of the product inferred from the product name")   sub_category: Optional[str] = Field(default=None, description="Sub-category of the product inferred from the product name")   product_line: Optional[str] = Field(default=None, description="Product Line of the product inferred from the product name")   gender: Optional[str] = Field(default=None, description="Gender of the product inferred from the product name")   year_of_first_sale: Optional[str] = Field(default=None, description="Year of first sale of the product inferred from the product name")   season: Optional[str] = Field(default=None, description="Season of the product inferred from the product name")class List_of_CCData(BaseModel):    list_of_dict: List[CCData]

We also use OutputFixingParser to handle situations where the initial parsing attempt fails. The following screenshot shows a sample generated .csv file.

Prompt engineering

Prompt engineering involves the skillful crafting and refining of input prompts. This process entails choosing the right words, phrases, sentences, punctuation, and separator characters to efficiently use LLMs for diverse applications. Essentially, prompt engineering is about effectively interacting with an LLM. The most effective strategy for prompt engineering needs to vary based on the specific task and data, specifically, data card generation and GoDaddy SKUs.

Prompts consist of particular inputs from the user that direct LLMs to produce a suitable response or output based on a specified task or instruction. These prompts include several elements, such as the task or instruction itself, the surrounding context, full examples, and the input text that guides LLMs in crafting their responses. The composition of the prompt will vary based on factors like the specific use case, data availability, and the nature of the task at hand. For example, in a Retrieval Augmented Generation (RAG) use case, we provide additional context and add a user-supplied query in the prompt that asks the LLM to focus on contexts that can answer the query. In a metadata generation use case, we can provide the image and ask the LLM to generate a description and keywords describing the image in a specific format.

In this post, we briefly distribute the prompt engineering solutions into two steps: output generation and format parsing.

Output generation

The following are best practices and considerations for output generation:

<example>H: <question> The list of product names is:{few_shot_product_name} </question>A: <response> The category information generated with absolutely no missing value, in JSON format is:{few_shot_field} </response></example>
\n\nHuman:......\n\nAssistant:
\n\nAssistant: Here are the answer with NO missing, unknown, null, or N/A values (in JSON format):
[INST]If the list of product names is:{few_shot_product_name}[/INST]Then the answer with NO missing, unknown, null, or N/A values is (in JSON format):{few_shot_field}[INST]If the list of product names is:{product_name}[/INST]Then the answer with NO missing, unknown, null, or N/A values is (in JSON format):

Format parsing

The following are best practices and considerations for format parsing:

You are a Product Information Manager, Taxonomist, and Categorization Expert who follows instruction well.

EVERY category information needs to be filled based on BOTH product name AND your best guess. If you forget to generate any category information, leave it as missing or N/A, then an innocent people will die.

few_shot_field='{\"list_of_dict\"' +':[' +', \n'.join([true_df.iloc[i].to_json() for i in range(num_few_shot)]) +']}'
{    "list_of_dict": [{        "some_category": "your_generated_answer",        "another_category": "your_generated_answer",    },    {        <category information for the 2st product name, in json format>    },    {        <category information for the 3st product name, in json format>    },// ... {additional product information, in json format} ...    }]}

Format your output in the JSON format (ensure to escape special character):
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:

{“properties”: {“list_of_dict”: {“title”: “List Of Dict”, “type”: “array”, “items”: {“$ref”: “#/definitions/CCData”}}}, “required”: [“list_of_dict”], “definitions”: {“CCData”: {“title”: “CCData”, “type”: “object”, “properties”: {“product_name”: {“title”: “Product Name”, “description”: “product name, which will be given as input”, “type”: “string”}, “brand”: {“title”: “Brand”, “description”: “Brand of the product inferred from the product name”, “type”: “string”}, “color”: {“title”: “Color”, “description”: “Color of the product inferred from the product name”, “type”: “string”}, “material”: {“title”: “Material”, “description”: “Material of the product inferred from the product name”, “type”: “string”}, “price”: {“title”: “Price”, “description”: “Price of the product inferred from the product name”, “type”: “string”}, “category”: {“title”: “Category”, “description”: “Category of the product inferred from the product name”, “type”: “string”}, “sub_category”: {“title”: “Sub Category”, “description”: “Sub-category of the product inferred from the product name”, “type”: “string”}, “product_line”: {“title”: “Product Line”, “description”: “Product Line of the product inferred from the product name”, “type”: “string”}, “gender”: {“title”: “Gender”, “description”: “Gender of the product inferred from the product name”, “type”: “string”}, “year_of_first_sale”: {“title”: “Year Of First Sale”, “description”: “Year of first sale of the product inferred from the product name”, “type”: “string”}, “season”: {“title”: “Season”, “description”: “Season of the product inferred from the product name”, “type”: “string”}}}}}

Models and parameters

We used the following prompting parameters:

For Llama 2, the model choices were meta.llama2-13b-chat-v1 or meta.llama2-70b-chat-v1. We used the following LLM parameters:

{    "temperature": 0.1,    "top_p": 0.9,    "max_gen_len": 2048,}

For Anthropic’s Claude, the model choices were anthropic.claude-instant-v1 and anthropic.claude-v2. We used the following LLM parameters:

{   "temperature": 0.1,   "top_k": 250,   "top_p": 1,   "max_tokens_to_sample": 4096,   "stop_sequences": ["\n\nHuman:"],   "anthropic_version": "bedrock-2023-05-31"}

The solution is straightforward to extend to other LLMs hosted on Amazon Bedrock, such as Amazon Titan (switch the model ID to amazon.titan-tg1-large, for example), Jurassic (model ID ai21.j2-ultra), and more.

Evaluations

The framework includes evaluation metrics that can be extended further to accommodate changes in accuracy requirements. Currently, it involves five different metrics:

Results

The following are the approximate sample input and output lengths under some best performing settings:

Quantitative results

The following table summarizes our consolidated quantitative results.

Config Latency Accuracy
Batch process service Model Prompt Batch process latency (5 packing) Near-real-time process latency (1 packing) Programmatic evaluation (coverage)
test set = 20 test set = 5k GoDaddy rqmt @ 5k Recall on parsing exact match Final content coverage
Amazon Bedrock batch inference Llama2-13b zero-shot n/a n/a 3600s n/a n/a n/a
5-shot (template12) 65.4s 1704s 3600s 72/20=3.6s 92.60% 53.90%
Llama2-70b zero-shot n/a n/a 3600s n/a n/a n/a
5-shot (template13) 139.6s 5299s 3600s 156/20=7.8s 98.30% 61.50%
Claude-v1 (instant) zero-shot (template6) 29s 723s 3600s 44.8/20=2.24s 98.50% 96.80%
5-shot (template12) 30.3s 644s 3600s 51/20=2.6s 99% 84.40%
Claude-v2 zero-shot (template6) 82.2s 1706s 3600s 104/20=5.2s 99% 84.40%
5-shot (template14) 49.1s 1323s 3600s 104/20=5.2s 99.40% 90.10%

The following tables summarize the scaling effect in batch inference.

Batch process service Model Prompt Batch process latency (5 packing) Near-real-time process latency (1 packing)
test set = 20 test set = 5k GoDaddy rqmt @ 5k test set = 100k
Amazon Bedrock batch Claude-v1 (instant) zero-shot (template6) 29s 723s 3600s 5733s 44.8/20=2.24s
Amazon Bedrock batch Anthropic’s Claude-v2 zero-shot (template6) 82.2s 1706s 3600s 7689s 104/20=5.2s
Batch process service Near-real-time process latency (1 packing) Programmatic evaluation (coverage)
Parsing recall on product name (test set = 5k) Parsing recall on product name (test set = 100k) Final content coverage (test set = 5k) Final content coverage (test set = 100k)
Amazon Bedrock batch 44.8/20=2.24s 98.50% 98.40% 96.80% 96.50%
Amazon Bedrock batch 104/20=5.2s 99% 98.80% 84.40% 97%

The following table summarizes the effect of n-packing. Llama 2 has an output length limit of 2,048 and fits up to around 20 packing. Anthropic’s Claude has a higher limit. We tested on 20 ground truth samples for 1, 5, and 10 packing and selected results from all model and prompt templates. The scaling effect on latency was more obvious in the Anthropic’s Claude model family than Llama 2. Anthropic’s Claude had better generalizability than Llama 2 when extending the packing numbers in output.

We only tried a few shots with Llama 2 models, which showed improved accuracy over zero-shot.

Batch process service Model Prompt Latency (test set = 20) Accuracy (final coverage)
npack = 1 npack= 5 npack = 10 npack = 1 npack= 5 npack = 10
Amazon Bedrock batch inference Llama2-13b 5-shot (template12) 72s 65.4s 65s 95.90% 93.20% 88.90%
Llama2-70b 5-shot (template13) 156s 139.6s 150s 85% 97.70% 100%
Claude-v1 (instant) zero-shot (template6) 45s 29s 27s 99.50% 99.50% 99.30%
5-shot (template12) 51.3s 30.3s 27.4s 99.50% 99.50% 100%
Claude-v2 zero-shot (template6) 104s 82.2s 67s 85% 97.70% 94.50%
5-shot (template14) 104s 49.1s 43.5s 97.70% 100% 99.80%

Qualitative results

We noted the following qualitative results:

Business takeaways

We had the following key business takeaways:

Technical takeaways

We had the following key technical takeaways:

Next steps for GoDaddy

The following are the recommendations that the GoDaddy team is considering as a part of future steps:

Conclusion

In this post, we shared how the Generative AI Innovation Center team worked with GoDaddy to create a more accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system. We implemented n-packing techniques and used Anthropic’s Claude and Meta Llama 2 models to improve latency. We experimented with different prompts to improve the categorization with LLMs and found that Anthropic’s Claude model family gave the better accuracy and generalizability than the Llama 2 model family. GoDaddy team will test this solution on a larger dataset and evaluate the categories generated from the recommended approaches.

If you’re interested in working with the AWS Generative AI Innovation Center, please reach out.

Security Best Practices

References


About the Authors

Vishal Singh is a Data Engineering leader at the Data and Analytics team of GoDaddy. His key focus area is towards building data products and generating insights from them by application of data engineering tools along with generative AI.

Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.

Meghana Ashok is a Machine Learning Engineer at the Generative AI Innovation Center. She collaborates closely with customers, guiding them in developing secure, cost-efficient, and resilient solutions and infrastructure tailored to their generative AI needs.

Karan Sindwani is an Applied Scientist at AWS where he works with AWS customers across different verticals to accelerate their use of Gen AI and AWS Cloud services to solve their business challenges.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he uses his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GoDaddy 生成式AI 产品分类 批处理推理
相关文章