AWS Machine Learning Blog 07月18日 06:16
Evaluating generative AI models with Amazon Nova LLM-as-a-Judge on Amazon SageMaker AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

评估大型语言模型(LLMs)的性能,不能仅依赖传统的统计指标。对于生成式AI的实际应用,如内容摘要或智能代理,理解模型输出是否优于基线或先前版本至关重要。为满足日益增长的客户需求,亚马逊推出了Amazon Nova LLM-as-a-Judge功能,集成于Amazon SageMaker AI平台。该功能利用LLM的推理能力,提供更灵活、大规模的模型评估。Nova LLM-as-a-Judge经过严格验证,表现出色的公正性和准确性,与人类偏好高度一致,在JudgeBench和PPE等关键基准测试中均取得领先成绩,旨在为生成式AI的生产级评估树立新标准。

🎯 **LLM-as-a-Judge的兴起与必要性**:传统的模型评估方法(如困惑度或BLEU分数)难以捕捉生成式AI在实际应用中的细微差别,尤其是在需要主观判断和情境理解的任务中。LLM-as-a-Judge作为一种新兴的评估范式,利用大型语言模型自身的推理能力,为模型评估提供了更灵活、可扩展的解决方案,以满足企业在生产环境中对模型质量进行系统性评估的需求。

🌟 **Amazon Nova LLM-as-a-Judge的功能与优势**:Amazon Nova LLM-as-a-Judge集成于Amazon SageMaker AI,提供了一种全面的模型评估方法。它能够对生成式AI输出进行稳健、无偏倚的评估,支持模型迭代间的成对比较,帮助用户做出数据驱动的模型改进决策。该功能经过严格验证,在关键的JudgeBench和PPE基准测试中表现出色,准确率分别达到45%和68%,并能最大限度地反映人类偏好,同时将内部偏差降至3%,确保了评估的可靠性和公正性。

⚙️ **评估工作流程与指标解读**:该评估流程通过SageMaker AI的预构建容器运行,用户需准备包含提示和两个模型输出的JSONL格式数据集。评估完成后,SageMaker会生成包括核心偏好指标(如a_scores、b_scores、ties)、统计置信指标(如winrate、lower_rate、upper_rate)和标准误差指标在内的量化结果。通过分析这些指标,用户可以判断模型间的优劣,并理解评估结果的置信度,从而指导模型优化和部署决策。

🚀 **端到端解决方案与模型部署**:文章详细介绍了如何在SageMaker AI上实现Nova LLM-as-a-Judge评估。通过部署Qwen2.5模型到SageMaker端点,并调用Amazon Bedrock中的Anthropic Claude 3.7 Sonnet模型,可以生成待比较的模型输出。随后,利用SageMaker的评估配方和预构建容器执行评估任务,并将结果保存至S3。最终,通过可视化工具展示评估结果,从而实现端到端的模型评估和优化,无需手动标注。

Evaluating the performance of large language models (LLMs) goes beyond statistical metrics like perplexity or bilingual evaluation understudy (BLEU) scores. For most real-world generative AI scenarios, it’s crucial to understand whether a model is producing better outputs than a baseline or an earlier iteration. This is especially important for applications such as summarization, content generation, or intelligent agents where subjective judgments and nuanced correctness play a central role.

As organizations deepen their deployment of these models in production, we’re experiencing an increasing demand from customers who want to systematically assess model quality beyond traditional evaluation methods. Current approaches like accuracy measurements and rule-based evaluations, although helpful, can’t fully address these nuanced assessment needs, particularly when tasks require subjective judgments, contextual understanding, or alignment with specific business requirements. To bridge this gap, LLM-as-a-judge has emerged as a promising approach, using the reasoning capabilities of LLMs to evaluate other models more flexibly and at scale.

Today, we’re excited to introduce a comprehensive approach to model evaluation through the Amazon Nova LLM-as-a-Judge capability on Amazon SageMaker AI, a fully managed Amazon Web Services (AWS) service to build, train, and deploy machine learning (ML) models at scale. Amazon Nova LLM-as-a-Judge is designed to deliver robust, unbiased assessments of generative AI outputs across model families. Nova LLM-as-a-Judge is available as optimized workflows on SageMaker AI, and with it, you can start evaluating model performance against your specific use cases in minutes. Unlike many evaluators that exhibit architectural bias, Nova LLM-as-a-Judge has been rigorously validated to remain impartial and has achieved leading performance on key judge benchmarks while closely reflecting human preferences. With its exceptional accuracy and minimal bias, it sets a new standard for credible, production-grade LLM evaluation.

Nova LLM-as-a-Judge capability provides pairwise comparisons between model iterations, so you can make data-driven decisions about model improvements with confidence.

How Nova LLM-as-a-Judge was trained

Nova LLM-as-a-Judge was built through a multistep training process comprising supervised training and reinforcement learning stages that used public datasets annotated with human preferences. For the proprietary component, multiple annotators independently evaluated thousands of examples by comparing pairs of different LLM responses to the same prompt. To verify consistency and fairness, all annotations underwent rigorous quality checks, with final judgments calibrated to reflect broad human consensus rather than an individual viewpoint.

The training data was designed to be both diverse and representative. Prompts spanned a wide range of categories, including real-world knowledge, creativity, coding, mathematics, specialized domains, and toxicity, so the model could evaluate outputs across many real-world scenarios. Training data included data from over 90 languages and is primarily composed of English, Russian, Chinese, German, Japanese, and Italian.Importantly, an internal bias study evaluating over 10,000 human-preference judgments against 75 third-party models confirmed that Amazon Nova LLM-as-a-Judge shows only a 3% aggregate bias relative to human annotations. Although this is a significant achievement in reducing systematic bias, we still recommend occasional spot checks to validate critical comparisons.

In the following figure, you can see how the Nova LLM-as-a-Judge bias compares to human preferences when evaluating Amazon Nova outputs compared to outputs from other models. Here, bias is measured as the difference between the judge’s preference and human preference across thousands of examples. A positive value indicates the judge slightly favors Amazon Nova models, and a negative value indicates the opposite. To quantify the reliability of these estimates, 95% confidence intervals were computed using the standard error for the difference of proportions, assuming independent binomial distributions.

Amazon Nova LLM-as-a-Judge achieves advanced performance among evaluation models, demonstrating strong alignment with human judgments across a range of tasks. For example, it scores 45% accuracy on JudgeBench (compared to 42% for Meta J1 8B) and 68% on PPE (versus 60% for Meta J1 8B). The data from Meta’s J1 8B was pulled from Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning.

These results highlight the strength of Amazon Nova LLM-as-a-Judge in chatbot-related evaluations, as shown in the PPE benchmark. Our benchmarking follows current best practices, reporting reconciled results for positionally swapped responses on JudgeBench, CodeUltraFeedback, Eval Bias, and LLMBar, while using single-pass results for PPE.

Model Eval Bias Judge Bench LLM Bar PPE CodeUltraFeedback
Nova LLM-as-a-Judge 0.76 0.45 0.67 0.68 0.64
Meta J1 8B 0.42 0.60
Nova Micro (8B) 0.56 0.37 0.55 0.6

In this post, we present a streamlined approach to implementing Amazon Nova LLM-as-a-Judge evaluations using SageMaker AI, interpreting the resulting metrics, and applying this process to improve your generative AI applications.

Overview of the evaluation workflow

The evaluation process starts by preparing a dataset in which each example includes a prompt and two alternative model outputs. The JSONL format looks like this:

{   "prompt":"Explain photosynthesis.",   "response_A":"Answer A...",   "response_B":"Answer B..."}{   "prompt":"Summarize the article.",   "response_A":"Answer A...",   "response_B":"Answer B..."}

After preparing this dataset, you use the given SageMaker evaluation recipe, which configures the evaluation strategy, specifies which model to use as the judge, and defines the inference settings such as temperature and top_p.

The evaluation runs inside a SageMaker training job using pre-built Amazon Nova containers. SageMaker AI provisions compute resources, orchestrates the evaluation, and writes the output metrics and visualizations to Amazon Simple Storage Service (Amazon S3).

When it’s complete, you can download and analyze the results, which include preference distributions, win rates, and confidence intervals.

Understanding how Amazon Nova LLM-as-a-Judge works

The Amazon Nova LLM-as-a-Judge uses an evaluation method called binary overall preference judge. The binary overall preference judge is a method where a language model compares two outputs side by side and picks the better one or declares a tie. For each example, it produces a clear preference. When you aggregate these judgments over many samples, you get metrics like win rate and confidence intervals. This approach uses the model’s own reasoning to assess qualities like relevance and clarity in a straightforward, consistent way.

Understanding Amazon Nova LLM-as-a-Judge evaluation metrics

When using the Amazon Nova LLM-as-a-Judge framework to compare outputs from two language models, SageMaker AI produces a comprehensive set of quantitative metrics. You can use these metrics to assess which model performs better and how reliable the evaluation is. The results fall into three main categories: core preference metrics, statistical confidence metrics, and standard error metrics.

The core preference metrics report how often each model’s outputs were preferred by the judge model. The a_scores metric counts the number of examples where Model A was favored, and b_scores counts cases where Model B was chosen as better. The ties metric captures instances in which the judge model rated both responses equally or couldn’t identify a clear preference. The inference_error metric counts cases where the judge couldn’t generate a valid judgment due to malformed data or internal errors.

The statistical confidence metrics quantify how likely it is that the observed preferences reflect true differences in model quality rather than random variation. The winrate reports the proportion of all valid comparisons in which Model B was preferred. The lower_rate and upper_rate define the lower and upper bounds of the 95% confidence interval for this win rate. For example, a winrate of 0.75 with a confidence interval between 0.60 and 0.85 suggests that, even accounting for uncertainty, Model B is consistently favored over Model A. The score field often matches the count of Model B wins but can also be customized for more complex evaluation strategies.

The standard error metrics provide an estimate of the statistical uncertainty in each count. These include a_scores_stderr, b_scores_stderr, ties_stderr, inference_error_stderr, andscore_stderr. Smaller standard error values indicate more reliable results. Larger values can point to a need for additional evaluation data or more consistent prompt engineering.

Interpreting these metrics requires attention to both the observed preferences and the confidence intervals:

The following is an example metrics output from an evaluation run:

{  "a_scores": 16.0,  "a_scores_stderr": 0.03,  "b_scores": 10.0,  "b_scores_stderr": 0.09,  "ties": 0.0,  "ties_stderr": 0.0,  "inference_error": 0.0,  "inference_error_stderr": 0.0,  "score": 10.0,  "score_stderr": 0.09,  "winrate": 0.38,  "lower_rate": 0.23,  "upper_rate": 0.56}

In this example, Model A was preferred 16 times, Model B was preferred 10 times, and there were no ties or inference errors. The winrate of 0.38 indicates that Model B was preferred in 38% of cases, with a 95% confidence interval ranging from 23% to 56%. Because the interval includes 0.5, this outcome suggests the evaluation was inconclusive, and additional data might be needed to clarify which model performs better overall.

These metrics, automatically generated as part of the evaluation process, provide a rigorous statistical foundation for comparing models and making data-driven decisions about which one to deploy.

Solution overview

This solution demonstrates how to evaluate generative AI models on Amazon SageMaker AI using the Nova LLM-as-a-Judge capability. The provided Python code guides you through the entire workflow.

First, it prepares a dataset by sampling questions from SQuAD and generating candidate responses from Qwen2.5 and Anthropic’s Claude 3.7. These outputs are saved in a JSONL file containing the prompt and both responses.

We accessed Anthropic’s Claude 3.7 Sonnet in Amazon Bedrock using the bedrock-runtime client. We accessed Qwen2.5 1.5B using a SageMaker hosted Hugging Face endpoint.

Next, a PyTorch Estimator launches an evaluation job using an Amazon Nova LLM-as-a-Judge recipe. The job runs on GPU instances such as ml.g5.12xlarge and produces evaluation metrics, including win rates, confidence intervals, and preference counts. Results are saved to Amazon S3 for analysis.

Finally, a visualization function renders charts and tables, summarizing which model was preferred, how strong the preference was, and how reliable the estimates are. Through this end-to-end approach, you can assess improvements, track regressions, and make data-driven decisions about deploying generative models—all without manual annotation.

Prerequisites

You need to complete the following prerequisites before you can run the notebook:

    Make the following quota increase requests for SageMaker AI. For this use case, you need to request a minimum of 1 g5.12xlarge instance. On the Service Quotas console, request the following SageMaker AI quotas, 1 G5 instances (g5.12xlarge) for training job usage (Optional) You can create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding role. (You can use JupyterLab in your local setup, too.)
      Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonS3FullAccess, and AmazonBedrockFullAccess to give required access to SageMaker AI and Amazon Bedrock to run the examples. Assign as trust relationship to your IAM role the following policy:
{    "Version": "2012-10-17",    "Statement": [        {            "Sid": "",            "Effect": "Allow",            "Principal": {                "Service": [                    "bedrock.amazonaws.com",                    "sagemaker.amazonaws.com"                ]            },            "Action": "sts:AssumeRole"        }    ]}
    Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets:
git clone https://github.com/aws-samples/amazon-nova-samples.gitcd customization/SageMakerTrainingJobs/Amazon-Nova-LLM-As-A-Judge/

Next, run the notebook Nova Amazon-Nova-LLM-as-a-Judge-Sagemaker-AI.ipynb to start using the Amazon Nova LLM-as-a-Judge implementation on Amazon SageMaker AI.

Model setup

To conduct an Amazon Nova LLM-as-a-Judge evaluation, you need to generate outputs from the candidate models you want to compare. In this project, we used two different approaches: deploying a Qwen2.5 1.5B model on Amazon SageMaker and invoking Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock. First, we deployed Qwen2.5 1.5B, an open-weight multilingual language model, on a dedicated SageMaker endpoint. This was achieved by using the HuggingFaceModel deployment interface. To deploy the Qwen2.5 1.5B model, we provided a convenient script for you to invoke:python3 deploy_sm_model.py

When it’s deployed, inference can be performed using a helper function wrapping the SageMaker predictor API:

# Initialize the predictor oncepredictor = HuggingFacePredictor(endpoint_name="qwen25-<endpoint_name_here>")def generate_with_qwen25(prompt: str, max_tokens: int = 500, temperature: float = 0.9) -> str:    """    Sends a prompt to the deployed Qwen2.5 model on SageMaker and returns the generated response.    Args:        prompt (str): The input prompt/question to send to the model.        max_tokens (int): Maximum number of tokens to generate.        temperature (float): Sampling temperature for generation.    Returns:        str: The model-generated text.    """    response = predictor.predict({        "inputs": prompt,        "parameters": {            "max_new_tokens": max_tokens,            "temperature": temperature        }    })    return response[0]["generated_text"]answer = generate_with_qwen25("What is the Grotto at Notre Dame?")print(answer)

In parallel, we integrated Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock. Amazon Bedrock provides a managed API layer for accessing proprietary foundation models (FMs) without managing infrastructure. The Claude generation function used the bedrock-runtime AWS SDK for Python (Boto3) client, which accepted a user prompt and returned the model’s text completion:

# Initialize Bedrock client oncebedrock = boto3.client("bedrock-runtime", region_name="us-east-1")# (Claude 3.7 Sonnet) model ID via BedrockMODEL_ID = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"def generate_with_claude4(prompt: str, max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.9) -> str:    """    Sends a prompt to the Claude 4-tier model via Amazon Bedrock and returns the generated response.    Args:        prompt (str): The user message or input prompt.        max_tokens (int): Maximum number of tokens to generate.        temperature (float): Sampling temperature for generation.        top_p (float): Top-p nucleus sampling.    Returns:        str: The text content generated by Claude.    """    payload = {        "anthropic_version": "bedrock-2023-05-31",        "messages": [{"role": "user", "content": prompt}],        "max_tokens": max_tokens,        "temperature": temperature,        "top_p": top_p    }    response = bedrock.invoke_model(        modelId=MODEL_ID,        body=json.dumps(payload),        contentType="application/json",        accept="application/json"    )    response_body = json.loads(response['body'].read())    return response_body["content"][0]["text"]answer = generate_with_claude4("What is the Grotto at Notre Dame?")print(answer)

When you have both functions generated and tested, you can move on to creating the evaluation data for the Nova LLM-as-a-Judge.

Prepare the dataset

To create a realistic evaluation dataset for comparing the Qwen and Claude models, we used the Stanford Question Answering Dataset (SQuAD), a widely adopted benchmark in natural language understanding distributed under the CC BY-SA 4.0 license. SQuAD consists of thousands of crowd-sourced question-answer pairs covering a diverse range of Wikipedia articles. By sampling from this dataset, we made sure that our evaluation prompts reflected high-quality, factual question-answering tasks representative of real-world applications.

We began by loading a small subset of examples to keep the workflow fast and reproducible. Specifically, we used the Hugging Face datasets library to download and load the first 20 examples from the SQuAD training split:

from datasets import load_datasetsquad = load_dataset("squad", split="train[:20]")

This command retrieves a slice of the full dataset, containing 20 entries with structured fields including context, question, and answers. To verify the contents and inspect an example, we printed out a sample question and its ground truth answer:

print(squad[3]["question"])print(squad[3]["answers"]["text"][0])

For the evaluation set, we selected the first six questions from this subset:

questions = [squad[i]["question"] for i in range(6)]

Generate the Amazon Nova LLM-as-a-Judge evaluation dataset

After preparing a set of evaluation questions from SQuAD, we generated outputs from both models and assembled them into a structured dataset to be used by the Amazon Nova LLM-as-a-Judge workflow. This dataset serves as the core input for SageMaker AI evaluation recipes. To do this, we iterated over each question prompt and invoked the two generation functions defined earlier:

For each prompt, the workflow attempted to generate a response from each model. If a generation call failed due to an API error, timeout, or other issue, the system captured the exception and stored a clear error message indicating the failure. This made sure that the evaluation process could proceed gracefully even in the presence of transient errors:

import jsonoutput_path = "llm_judge.jsonl"with open(output_path, "w") as f:    for q in questions:        try:            response_a = generate_with_qwen25(q)        except Exception as e:            response_a = f"[Qwen2.5 generation failed: {e}]"                try:            response_b = generate_with_claude4(q)        except Exception as e:            response_b = f"[Claude 3.7 generation failed: {e}]"        row = {            "prompt": q,            "response_A": response_a,            "response_B": response_b        }        f.write(json.dumps(row) + "\n")print(f"JSONL file created at: {output_path}")

This workflow produced a JSON Lines file named llm_judge.jsonl. Each line contains a single evaluation record structured as follows:

{  "prompt": "What is the capital of France?",  "response_A": "The capital of France is Paris.",  "response_B": "Paris is the capital city of France."}

Then, upload this llm_judge.jsonl to an S3 bucket that you’ve predefined:

upload_to_s3(    "llm_judge.jsonl",    "s3://<YOUR_BUCKET_NAME>/datasets/byo-datasets-dev/custom-llm-judge/llm_judge.jsonl")

Launching the Nova LLM-as-a-Judge evaluation job

After preparing the dataset and creating the evaluation recipe, the final step is to launch the SageMaker training job that performs the Amazon Nova LLM-as-a-Judge evaluation. In this workflow, the training job acts as a fully managed, self-contained process that loads the model, processes the dataset, and generates evaluation metrics in your designated Amazon S3 location.

We use the PyTorch estimator class from the SageMaker Python SDK to encapsulate the configuration for the evaluation run. The estimator defines the compute resources, the container image, the evaluation recipe, and the output paths for storing results:

estimator = PyTorch(    output_path=output_s3_uri,    base_job_name=job_name,    role=role,    instance_type=instance_type,    training_recipe=recipe_path,    sagemaker_session=sagemaker_session,    image_uri=image_uri,    disable_profiler=True,    debugger_hook_config=False,)

When the estimator is configured, you initiate the evaluation job using the fit() method. This call submits the job to the SageMaker control plane, provisions the compute cluster, and begins processing the evaluation dataset:

estimator.fit(inputs={"train": evalInput})

Results from the Amazon Nova LLM-as-a-Judge evaluation job

The following graphic illustrates the results of the Amazon Nova LLM-as-a-Judge evaluation job.

To help practitioners quickly interpret the outcome of a Nova LLM-as-a-Judge evaluation, we created a convenience function that produces a single, comprehensive visualization summarizing key metrics. This function, plot_nova_judge_results, uses Matplotlib and Seaborn to render an image with six panels, each highlighting a different perspective of the evaluation outcome.

This function takes the evaluation metrics dictionary—produced when the evaluation job is complete—and generates the following visual components:

Because the function outputs a standard Matplotlib figure, you can quickly save the image, display it in Jupyter notebooks, or embed it in other documentation.

Clean up

Complete the following steps to clean up your resources:

    Delete your Qwen 2.5 1.5B Endpoint
    import boto3# Create a low-level SageMaker service client.sagemaker_client = boto3.client('sagemaker', region_name=<region>)# Delete endpointsagemaker_client.delete_endpoint(EndpointName=endpoint_name)
    If you’re using a SageMaker Studio JupyterLab notebook, shut down the JupyterLab notebook instance.

How you can use this evaluation framework

The Amazon Nova LLM-as-a-Judge workflow offers a reliable, repeatable way to compare two language models on your own data. You can integrate this into model selection pipelines to decide which version performs best, or you can schedule it as part of continuous evaluation to catch regressions over time.

For teams building agentic or domain-specific systems, this approach provides richer insight than automated metrics alone. Because the entire process runs on SageMaker training jobs, it scales quickly and produces clear visual reports that can be shared with stakeholders.

Conclusion

This post demonstrates how Nova LLM-as-a-Judge—a specialized evaluation model available through Amazon SageMaker AI—can be used to systematically measure the relative performance of generative AI systems. The walkthrough shows how to prepare evaluation datasets, launch SageMaker AI training jobs with Nova LLM-as-a-Judge recipes, and interpret the resulting metrics, including win rates and preference distributions. The fully managed SageMaker AI solution simplifies this process, so you can run scalable, repeatable model evaluations that align with human preferences.

We recommend starting your LLM evaluation journey by exploring the official Amazon Nova documentation and examples. The AWS AI/ML community offers extensive resources, including workshops and technical guidance, to support your implementation journey.

To learn more, visit:


About the authors

Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.

Joel Carlson is a Senior Applied Scientist on the Amazon AGI foundation modeling team. He primarily works on developing novel approaches for improving the LLM-as-a-Judge capability of the Nova family of models.

Saurabh Sahu is an applied scientist in the Amazon AGI Foundation modeling team. He obtained his PhD in Electrical Engineering from University of Maryland College Park in 2019. He has a background in multi-modal machine learning working on speech recognition, sentiment analysis and audio/video understanding. Currently, his work focuses on developing recipes to improve the performance of LLM-as-a-judge models for various tasks.

Morteza Ziyadi is an Applied Science Manager at Amazon AGI, where he leads several projects on post-training recipes and (Multimodal) large language models in the Amazon AGI Foundation modeling team. Before joining Amazon AGI, he spent four years at Microsoft Cloud and AI, where he led projects focused on developing natural language-to-code generation models for various products. He has also served as an adjunct faculty at Northeastern University. He earned his PhD from the University of Southern California (USC) in 2017 and has since been actively involved as a workshop organizer, and reviewer for numerous NLP, Computer Vision and machine learning conferences.

Pradeep Natarajan is a Senior Principal Scientist in Amazon AGI Foundation modeling team working on post-training recipes and Multimodal large language models. He has 20+ years of experience in developing and launching multiple large-scale machine learning systems. He has a PhD in Computer Science from University of Southern California.

Michael Cai is a Software Engineer on the Amazon AGI Customization Team supporting the development of evaluation solutions. He obtained his MS in Computer Science from New York University in 2024. In his spare time he enjoys 3d printing and exploring innovative tech.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon SageMaker LLM评估 AI模型 LLM-as-a-Judge Amazon Nova
相关文章