AI judging AI: Scaling unstructured text analysis with Amazon Nova

Picture this: Your team just received 10,000 customer feedback responses. The traditional approach? Weeks of manual analysis. But what if AI could not only analyze this feedback but also validate its own work? Welcome to the world of large language model (LLM) jury systems deployed using Amazon Bedrock.

As more organizations embrace generative AI, particularly LLMs for various applications, a new challenge has emerged: ensuring that the output from these AI models aligns with human perspectives and is accurate and relevant to the business context. Manual analysis of large datasets can be time consuming, resource intensive, and thus impractical. For example, manually reviewing 2,000 comments can take over 80 hours, depending on comment length, complexity, and researcher analyses. LLMs offer a scalable approach to serve as qualitative text annotators, summarizers, and even judges evaluating text outputs from other AI systems.

This prompts the question, “But how can we deploy such LLM-as-a-judge systems effectively and then use other LLMs to evaluate performance?”

In this post, we highlight how you can deploy multiple generative AI models in Amazon Bedrock to instruct an LLM model to create thematic summaries of text responses (such as from open-ended survey questions to your customers) and then use multiple LLM models as a jury to review these LLM generated summaries and assign a rating to judge the content alignment between the summary title and summary description. This setup is often referred to as an LLM jury system. Think of the LLM jury as a panel of AI judges, each bringing their own perspective to evaluate content. Instead of relying on a single model’s potentially biased view, multiple models work together to provide a more balanced assessment.

Problem: Analyzing text feedback

Your organization receives thousands of customer feedback responses. Traditional manual analysis of responses might painstakingly and resource-intensively take days or weeks, depending on the volume of free text comments you receive. Alternative natural language processing techniques, though likely faster, also require extensive data cleanup and coding know-how to analyze the data effectively. Pre-trained LLMs offer a promising, relatively low-code solution for quickly generating thematic summaries from text-based data because these models have been shown to scale data analysis and reduce manual review time. However, when relying on a single pre-trained LLM for both analysis and evaluation, concerns arise regarding biases, such as model hallucinations (that is, producing inaccurate information) or confirmation bias (that is, favoring expected outcomes). Without cross-validation mechanisms, such as comparing outputs from multiple models or benchmarking against human-reviewed data, the risk of unchecked errors increases. Using multiple pre-trained LLMs can address this concern by providing robust and comprehensive analyses, even allowing for enabling human-in-the-loop oversight, and enhancing reliability over a single-model evaluation. The concept of using LLMs as a jury means deploying multiple generative AI models to independently evaluate or validate each other’s outputs.

Solution: Deploy LLM as judges on Amazon Bedrock

You can use Amazon Bedrock to compare the various frontier foundation models (FMs) such as Anthropic’s Claude 3 Sonnet, Amazon Nova Pro, and Meta’s Llama 3. The unified Amazon Web Services (AWS) environment and standardized API calls simplify deploying multiple models for thematic analysis and judging model outputs. Amazon Bedrock also solves for operational needs through a unified security and compliance controlled system and a consistent model deployment environment across all models.

Our proposed workflow, illustrated in the following diagram, includes these steps:

Amazon SageMaker Studio

Cohen’s kappa

Krippendorff’s alpha

Spearman’s rho

Prerequisites

To complete the steps, you need to have the following prerequisites:

Amazon Bedrock

Getting Started with Amazon Bedrock

Amazon SageMaker AI

Getting Started with Amazon SageMaker AI

Amazon Simple Storage Service

(Amazon S3)

Getting Started with Amazon S3

Implementation details

In this section, we walk you through the step-by-step implementation.

Try this out for yourself by downloading the Jupyter notebook from GitHub.

SageMaker notebook instance

import boto3import json# Initialize our connection to AWS servicesbedrock = boto3.client('bedrock')s3_client = boto3.client('s3')# Configure where we'll store our evidence (data)bucket = 'my-example-name'raw_input = 'feedback_dummy_data.txt'output_themes = 'feedback_analyzed.txt'

def analyze_comment(comment):    prompt = f"""You must respond ONLY with a valid JSON object.    Analyze this customer review: "{comment}"    Respond with this exact JSON structure:    {{        "main_theme": "theme here",        "sub_theme": "sub-theme here",        "rationale": "rationale here"    }}    """    # Call pre-trained model through Bedrock    response = bedrock_runtime.invoke_model(        modelId=#model of choice goes here        body=json.dumps({            "prompt": prompt,            "max_tokens": 1000,            "temperature": 0.1        })    )    return parse_response(response)

You can now use multiple LLMs as jury to evaluate the themes generated by the LLM in the previous step. In our example, we use Amazon Nova Pro and Anthropic’s Claude 3.5 Sonnet models to each analyze the themes per feedback and provide an alignment score. Here, our alignment score is on a scale of 1–3, where 1 indicates poor alignment in which themes don’t capture the main points, 2 indicates partial alignment in which themes capture some but not all key points, and 3 indicates strong alignment in which themes accurately capture the main points:

def evaluate_alignment_nova(comment, theme, subtheme, rationale):    judge_prompt = f"""Rate theme alignment (1-3):    Comment: "{comment}"    Main Theme: {theme}    Sub-theme: {subtheme}    Rationale: {rationale}    """    # Complete code in attached notebook

When you have the alignment scores from the LLMs, here’s how you can implement the following agreement metrics to compare and contrast the scores. Here, if you have ratings from human judges, you can quickly add those as another set of scores to discover how closely the human ratings (gold standard) aligns with that of the models:

def calculate_agreement_metrics(ratings_df):    return {        'Percentage Agreement': calculate_percentage_agreement(ratings_df),        'Cohens Kappa': calculate_pairwise_cohens_kappa(ratings_df),        'Krippendorffs Alpha': calculate_krippendorffs_alpha(ratings_df),        'Spearmans Rho': calculate_spearmans_rho(ratings_df)    }

We used the following popular agreement metrics to compare alignment and therefore performance across and among models:

Percentage agreement

Cohen’s kappa

Spearman’s rho

Krippendorff’s alpha

Success! If you followed along, you have now successfully deployed multiple LLMs to judge thematic analysis output from an LLM.

Additional considerations

To help manage costs when running this solution, consider the following options:

SageMaker managed Spot Instances

Amazon Bedrock batch inference

For sensitive data, consider the following options:

AWS Identity and Access Management

Amazon Virtual Private Cloud

Results

In this post, we demonstrated how you can use Amazon Bedrock to seamlessly use multiple LLMs to generate and judge thematic summaries of qualitative data, such as from customer feedback. We also showed how we can compare human evaluator ratings of text-based summaries from survey response data against ratings from multiple LLMs such as Anthropic’s Claude 3 Sonnet, Amazon Nova Pro, and Meta’s Llama 3. In recently published research, Amazon scientists showed LLMs showed inter-model agreement up to 91% compared with human-to-model agreement up to 79%. Our findings suggest that although LLMs can provide reliable thematic evaluations at scale, human oversight continues to remain important for identifying subtle contextual nuances that LLMs might miss.

The best part? Through Amazon Bedrock model hosting, you can compare the various models using the same preprocessed data across all models, so you can choose the one that works best for your context and need.

Conclusion

With organizations turning to generative AI for analyzing unstructured data, this post provides insight into the value of using multiple LLMs to validate LLM-generated analyses. The strong performance of LLM-as-a-judge models opens opportunities to scale text data analyses at scale, and Amazon Bedrock can help organizations interact with and use multiple models to use an LLM-as-a-judge framework.

About the Authors

Dr. Sreyoshi Bhaduri is a Senior Research Scientist at Amazon. Currently, she spearheads innovative research in applying generative AI at scale to solve complex supply chain logistics and operations challenges. Her expertise spans applied statistics and natural language processing, with a PhD from Virginia Tech and specialized training in responsible AI from MILA. Sreyoshi is committed to demystifying and democratizing generative AI solutions and bridging the gap between theoretical research and practical applications using AWS technologies.

Dr. Natalie Perez specializes in transformative approaches to customer insights and innovative solutions using generative AI. Previously at AWS, Natalie pioneered large-scale voice of employee research, driving product and programmatic improvements. Natalie is dedicated to revolutionizing how organizations scale, understand, and act on customer needs through the strategic integration of generative AI and human-in-the-loop strategies, driving innovation that puts customers at the heart of product, program, and service development.

John Kitaoka is a Solutions Architect at Amazon Web Services (AWS) and works with government entities, universities, nonprofits, and other public sector organizations to design and scale AI solutions. His work covers a broad range of machine learning (ML) use cases, with a primary interest in inference, responsible AI, and security. In his spare time, he loves woodworking and snowboarding.

Dr. Elizabeth (Liz) Conjar is a Principal Research Scientist at Amazon, where she pioneers at the intersection of HR research, organizational transformation, and AI/ML. Specializing in people analytics, she helps reimagine employees’ work experiences, drive high-velocity organizational change, and develop the next generation of Amazon leaders. Throughout her career, Elizabeth has established herself as a thought leader in translating complex people analytics into actionable strategies. Her work focuses on optimizing employee experiences and accelerating organizational success through data-driven insights and innovative technological solutions.

Problem: Analyzing text feedback

Solution: Deploy LLM as judges on Amazon Bedrock

Prerequisites

Implementation details

Additional considerations

Results

Conclusion

About the Authors

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签