AWS Machine Learning Blog 2024年09月07日
Ground truth curation and metric interpretation best practices for evaluating generative AI question answering using FMEval
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文重点介绍了如何使用 FMEval 评估生成式 AI 应用中的问答,并解释了如何使用 FMEval 评估生成式 AI 应用中的问答,重点介绍了事实知识和 QA 准确性指标,以及如何根据最佳实践进行评估。

📈 **事实知识评估**:FMEval 的事实知识指标评估生成的响应中是否包含事实信息,该指标基于字符串匹配,得分是二进制的(0 或 1)。如果生成的响应中存在事实信息,则得分是 1,否则是 0。

📊 **QA 准确性评估**:QA 准确性指标通过将生成的答案与基本事实答案进行比较来衡量模型的问答准确性。该指标通过计算 QA 基本事实答案和生成的答案之间字符串匹配的真阳性、假阳性和假阴性词匹配来计算。

📉 **最佳实践**:在评估生成式 AI 应用中的问答时,最佳实践包括:构建一个高质量的基本事实数据集,利用 FMEval 评估指标来评估模型性能,并根据结果对模型进行改进。

📋 **应用场景**:FMEval 可用于评估各种生成式 AI 应用中的问答,例如用于客户支持的内部知识库,以及外部对话式 AI 助手。

📌 **商业价值**:通过使用 FMEval 评估生成式 AI 应用中的问答,企业可以确保其 AI 系统满足质量标准和用户体验预期,并为业务决策者提供可理解的评估结果,帮助他们做出明智的决策。

Generative artificial intelligence (AI) applications powered by large language models (LLMs) are rapidly gaining traction for question answering use cases. From internal knowledge bases for customer support to external conversational AI assistants, these applications use LLMs to provide human-like responses to natural language queries. However, building and deploying such assistants with responsible AI best practices requires a robust ground truth and evaluation framework to make sure they meet quality standards and user experience expectations, as well as clear evaluation interpretation guidelines to make the quality and responsibility of these systems intelligible to business decision-makers.

This post focuses on evaluating and interpreting metrics using FMEval for question answering in a generative AI application. FMEval is a comprehensive evaluation suite from Amazon SageMaker Clarify, providing standardized implementations of metrics to assess quality and responsibility. To learn more about FMEval, refer to Evaluate large language models for quality and responsibility.

In this post, we discuss best practices for working with FMEval in ground truth curation and metric interpretation for evaluating question answering applications for factual knowledge and quality. Ground truth data in AI refers to data that is known to be true, representing the expected outcome for the system being modeled. By providing a true expected outcome to measure against, ground truth data unlocks the ability to deterministically evaluate system quality. Ground truth curation and metric interpretation are tightly coupled, and the implementation of the evaluation metric must inform ground truth curation to achieve best results. By following these guidelines, data scientists can quantify the user experience delivered by their generative AI pipelines and communicate meaning to business stakeholders, facilitating ready comparisons across different architectures, such as Retrieval Augmented Generation (RAG) pipelines, off-the-shelf or fine-tuned LLMs, or agentic solutions.

Solution overview

We use an example ground truth dataset (referred to as the golden dataset, shown in the following table) of 10 question-answer-fact triplets. Each triplet describes a fact, and an encapsulation of the fact as a question-answer pair to emulate an ideal response, derived from a knowledge source document. We used Amazon’s Q2 2023 10Q report as the source document from the SEC’s public EDGAR dataset to create 10 question-answer-fact triplets. The 10Q report contains details on company financials and operations over the Q2 2023 business quarter. The golden dataset applies the ground truth curation best practices discussed in this post for most questions, but not all, to demonstrate the downstream impact of ground truth curation on metric results.

Question Answer Fact
Who is Andrew R. Jassy? Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc. Chief Executive Officer of Amazon<OR>CEO of Amazon<OR>President of Amazon
What were Amazon’s total net sales for the second quarter of 2023? Amazon’s total net sales for the second quarter of 2023 were $134.4 billion. 134.4 billion<OR>134,383 million<OR>134183 million<OR>134.383 billion
Where is Amazon’s principal office located? Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210. 410 Terry Avenue North
What was Amazon’s operating income for the six months ended June 30, 2023? Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion. 12.5 billion<OR>12,455 million<OR>12.455 billion
When did Amazon acquire One Medical? Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired. Feb 22 2023<OR>February 22nd 2023<OR>2023-02-22<OR>February 22, 2023
What was a key challenge faced by Amazon’s business in the second quarter of 2023? Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023. foreign exchange rates
What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023? Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion. 50.1 billion<OR>50,067 million<OR>50.067 billion
What were Amazon’s AWS sales for the second quarter of 2023? Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion. 22.1 billion<OR>22,140 million<OR>22.140 billion<OR>22140 million
As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold? As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock. 158 million
How many shares of common stock were outstanding as of July 21, 2023? There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023. 10317750796<OR>10,317,750,796

We generated responses from three generative AI RAG pipelines (anonymized as Pipeline1, Pipeline2, Pipeline3, as shown in the following figure) and calculated factual knowledge and QA accuracy metrics, evaluating them against the golden dataset. The fact key of the triplet is used for the Factual Knowledge metric ground truth, and the answer key is used for the QA Accuracy metric ground truth. With this, factual knowledge is measured against the fact key, and the ideal user experience in terms of style and conciseness is measured against the question-answer pairs.

Evaluation for question answering in a generative AI application

A generative AI pipeline can have many subcomponents, such as a RAG pipeline. RAG is a methodology to improve the accuracy of LLM responses answering a user query by retrieving and inserting relevant domain knowledge into the language model prompt. RAG quality depends on the configurations of the retriever (chunking, indexing) and generator (LLM selection and hyperparameters, prompt), as illustrated in the following figure. Tuning chunking and indexing in the retriever makes sure the correct content is available in the LLM prompt for generation. The chunk size and chunk splitting method, as well as the means of embedding and ranking relevant document chunks as vectors in the knowledge store, impacts whether the actual answer to the query is ultimately inserted in the prompt. In the generator, selecting an appropriate LLM to run the prompt, and tuning its hyperparameters and prompt template, all control how the retrieved information is interpreted for the response. With this, when a final response from a RAG pipeline is evaluated, the preceding components may be adjusted to improve response quality.

Alternatively, question answering can be powered by a fine-tuned LLM, or through an agentic approach. Although we demonstrate the evaluation of final responses from RAG pipelines, the final responses from a generative AI pipeline for question answering can be similarly evaluated because the prerequisites are a golden dataset and the generative answers. With this approach, changes in the generative output due to different generative AI pipeline architectures can be evaluated to inform the best design choices (comparing RAG and knowledge retrieval agents, comparing LLMs used for generation, retrievers, chunking, prompts, and so on).

Although evaluating each sub-component of a generative AI pipeline is important in development and troubleshooting, business decisions rely on having an end-to-end, side-by-side data view, quantifying how a given generative AI pipeline will perform in terms of user experience. With this, business stakeholders can understand expected quality changes in terms of end-user experience by switching LLMs, and adhere to legal and compliance requirements, such as ISO42001 AI Ethics. There are further financial benefits to realize; for example, quantifying expected quality changes on internal datasets when switching a development LLM to a cheaper, lightweight LLM in production. The overall evaluation process for the benefit of decision-makers is outlined in the following figure. In this post, we focus our discussion on ground truth curation, evaluation, and interpreting evaluation scores for entire question answering generative AI pipelines using FMEval to enable data-driven decision-making on quality.

A useful mental model for ground truth curation and improvement of a golden dataset is a flywheel, as shown in the following figure. The ground truth experimentation process involves querying your generative AI pipeline with the initial golden dataset questions and evaluating the responses against initial golden answers using FMEval. Then, the quality of the golden dataset must be reviewed by a judge. The judge review of the golden dataset quality accelerates the flywheel towards an ever-improving golden dataset. The judge role in the workflow can be assumed by another LLM to enable scaling against established, domain-specific criteria for high-quality ground truth. Maintaining a human-in-the-loop component to the judge function remains essential to sample and verify results, as well as to increase the quality bar with increasing task complexity. Improvement to the golden dataset fosters improvement to the quality of the evaluation metrics, until sufficient measurement accuracy in the flywheel is met by the judge, using the established criteria for quality. To learn more about AWS offerings on human review of generations and data labeling, such as Amazon Augmented AI (Amazon A2I) and Amazon SageMaker Ground Truth Plus, refer to Using Amazon Augmented AI for Human Review and High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus. When using LLMs as a judge, make sure to apply prompt safety best practices.

However, to conduct reviews of golden dataset quality as part of the ground truth experiment flywheel, human reviewers must understand the evaluation metric implementation and its coupling to ground truth curation.

FMEval metrics for question answering in a generative AI application

The Factual Knowledge and QA Accuracy metrics from FMEval provide a way to evaluate custom question answering datasets against ground truth. For a full list of metrics implemented with FMEval, refer to Using prompt datasets and available evaluation dimensions in model evaluation jobs.

Factual Knowledge

The Factual Knowledge metric evaluates whether the generated response contains factual information present in the ground truth answer. It is a binary (0 or 1) score based on a string match. Factual knowledge also reports a quasi-exact string match which performs matching after normalization. For simplicity, we focus on the exact match Factual Knowledge score in this post.

For each golden question:

QA Accuracy

The QA Accuracy metric measures a model’s question answering accuracy by comparing its generated answers against ground truth answers. The metrics are computed by string matching true positive, false positive, and false negative word matches between QA ground truth answers and generated answers.

It includes several sub-metrics:

Because QA Accuracy metrics are calculated on an exact match basis, (for more details, see Accuracy) they may be less reliable for questions where the answer can be rephrased without modifying its meaning. To mitigate this, we propose applying Factual Knowledge as the assessment of factual correctness, motivating the use of a dedicated factual ground truth with minimal word expression, together with QA Accuracy as a measure of idealized user experience in terms of response verbosity and style. We elaborate on these concepts later in this post. The BERTScore is also computed as part of QA Accuracy, which provides a measure of semantic match quality against the ground truth.

Proposed ground truth curation best practices for question answering with FMEval

In this section, we share best practices for curating your ground truth for question answering with FMEval.

Understanding the Factual Knowledge metric calculation

A factual knowledge score is a binary measure of whether a real-world fact was correctly retrieved by the generative AI pipeline. 0 indicates the lower-cased expected answer is not part of the model response, whereas 1 indicates it is. Where there is more than one acceptable answer, and either answer is considered correct, apply a logical operator for OR. A configuration for a logical AND can also be applied for cases where the factual material encompasses multiple distinct entities. In the present examples, we demonstrate a logical OR, using the <OR> delimiter. See Use SageMaker Clarify to evaluate large language models for information about logical operators. An example curation of a golden question and golden fact is shown in the following table.

Golden Question “How many shares of common stock were outstanding as of July 21, 2023?”
Golden Fact 10,317,750,796<OR>10317750796

Fact detection is useful for assessing hallucination in a generative AI pipeline. The two sample responses in the following table illustrate fact detection. The first example correctly states the fact in the example response, and receives a 1.0 score. The second example hallucinates a number instead of stating the fact, and receives a 0 score.

Metric Example Response Score Calculation Approach
Factual Knowledge “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.” 1.0 String match to golden fact
“Based on the documents provided, Amazon had 22,003,237,746 shares of common stock outstanding as of July 21, 2023.” 0.0

In the following example, we highlight the importance of units in ground truth for Factual Knowledge string matching. The golden question and golden fact represent Amazon’s total net sales for the second quarter of 2023.

Golden Question “What were Amazon’s total net sales for the second quarter of 2023?
Golden Fact 134.4 billion<OR>134,383 million

The first response hallucinates the fact, using units of billions, and correctly receives a score of 0.0. The second response correctly represents the fact, in units of millions. Both units should be represented in the golden fact. The third response was unable to answer the question, flagging a potential issue with the information retrieval step.

Metric Example Response Score Calculation Approach
Factual Knowledge Amazon’s total net sales for the second quarter of 2023 were $170.0 billion. 0.0 String match to golden fact
The total consolidated net sales for Q2 2023 were $134,383 million according to this report. 1.0
Sorry, the provided context does not include any information about Amazon’s total net sales for the second quarter of 2023. Would you like to ask another question? 0.0

Interpreting Factual Knowledge scores

Factual knowledge scores are a useful flag for challenges in the generative AI pipeline such as hallucination or information retrieval problems. Factual knowledge scores can be curated in the form of a Factual Knowledge Report for human review, as shown in the following table, to visualize pipeline quality in terms of fact detection side by side.

User Question QA Ground Truth Factual Ground Truth Pipeline 1 Pipeline 2 Pipeline 3
As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold? As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock. 158 million 1 1 1
How many shares of common stock were outstanding as of July 21, 2023? There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023. 10317750796<OR>10,317,750,796 1 1 1
What was Amazon’s operating income for the six months ended June 30, 2023? Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion. 12.5 billion<OR>12,455 million<OR>12.455 billion 1 1 1
What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023? Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion. 50.1 billion<OR>50,067 million<OR>50.067 billion 1 0 0
What was a key challenge faced by Amazon’s business in the second quarter of 2023? Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023. foreign exchange rates 0 0 0
What were Amazon’s AWS sales for the second quarter of 2023? Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion. 22.1 billion<OR>22,140 million<OR>22.140 billion<OR>22140 million 1 0 0
What were Amazon’s total net sales for the second quarter of 2023? Amazon’s total net sales for the second quarter of 2023 were $134.4 billion. 134.4 billion<OR>134,383 million<OR>134183 million<OR>134.383 billion 1 0 0
When did Amazon acquire One Medical? Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired. Feb 22 2023<OR>February 22nd 2023<OR>2023-02-22<OR>February 22, 2023 1 0 1
Where is Amazon’s principal office located? Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210. 410 Terry Avenue North 0 0 0
Who is Andrew R. Jassy? Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc. Chief Executive Officer of Amazon<OR>CEO of Amazon<OR>President of Amazon 1 1 1

Curating Factual Knowledge ground truth

Consider the impact of string matching between your ground truth and LLM responses when curating ground truth for Factual Knowledge. Best practices for curation in consideration of string matching are the following:

Understanding QA Accuracy metric calculation

We use the following question answer pair to demonstrate how FMEval metrics are calculated, and how this informs best practices in QA ground truth curation.

Golden Question “How many shares of common stock were outstanding as of July 21, 2023?”
Golden Answer “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

In calculating QA Accuracy metrics, first the responses and ground truth are first normalized (lowercase, remove punctuation, remove articles, remove excess whitespace). Then, true positive, false positives, and false negative matches are computed between the LLM response and the ground truth. QA Accuracy metrics returned by FMEval include recall, precision, F1. By assessing exact matching, the Exact Match and Quasi-Exact Match metrics are returned. A detailed walkthrough of the calculation and scores are shown in the following tables.

The first table illustrates the accuracy metric calculation mechanism.

Metric Definition Example Score
True Positive (TP) The number of words in the model output that are also contained in the ground truth.

Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”

11
False Positive (FP) The number of words in the model output that are not contained in the ground truth.

Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”

7
False Negative (FN) The number of words that are missing from the model output, but are included in the ground truth.

Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”

3

The following table lists the accuracy scores.

Metric Score Calculation Approach
Recall Over Words 0.786
Precision Over Words 0.611
F1 0.688
Exact Match 0.0 (Non-normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.
Quasi-Exact Match 0.0 (Normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.

Interpreting QA Accuracy scores

The following are best practices for interpreting QA accuracy scores:

 Interpretation Question Curated Ground Truth High Closeness to Ground Truth Low Closeness to Ground Truth
Interpreting Closeness to Ground Truth Scores “How many shares of common stock were outstanding as of July 21, 2023?” “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” “As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.” 0.923 “Sorry, I do not have access to documents containing common stock information about Amazon.” 0.111
 Interpretation Question Curated Ground Truth High Conciseness to Ground Truth Low Conciseness to Ground Truth
Interpreting Conciseness to Ground Truth “How many shares of common stock were outstanding as of July 21, 2023?” “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding. 1.0

“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.

Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states:

‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’

Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.”

0.238
 Interpretation Question Curated Ground Truth High Combined Closeness x Conciseness Low Combined Closeness x Conciseness
Interpreting Closeness and Conciseness to Ground Truth “How many shares of common stock were outstanding as of July 21, 2023?” “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” “As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.” 0.96

“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.

Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states:

‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’

Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.”

0.364
Interpretation QA Ground Truth Factual Ground Truth Factual Knowledge Recall Score LLM response
Hallucination detection Amazon’s total net sales for the second quarter of 2023 were $134.4 billion. 134.4 billion<OR>134,383 million 0 0.92 Amazon’s total net sales for the second quarter of 2023 were $170.0 billion.
Detect false positive facts There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.

10317750796<OR>

10,317,750,796

1.0 0.0 Document ID: 10317750796
Correct answer, expressed in different words to ground truth question-answer-fact Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210. 410 Terry Avenue North 0 0.54 Amazon’s principal office is located in Seattle, Washington.

Curating QA Accuracy ground truth

Consider the impact of true positive, false positive, and false negative matches between your golden answer and LLM responses when curating your ground truth for QA Accuracy. Best practices for curation in consideration of string matching are as follows:

Key takeaways

Curating appropriate ground truth and interpreting evaluation metrics in a feedback loop is crucial for effective business decision-making when deploying generative AI pipelines for question answering.

There were several key takeaways from this experiment:

Conclusion

In this post, we outlined best practices for ground truth curation and metric interpretation when evaluating generative AI question answering using FMEval. We demonstrated how to curate ground truth question-answer-fact triplets in consideration of the Factual Knowledge and QA Accuracy metrics calculated by FMEval. To validate our approach, we curated a golden dataset of 10 question-answer-fact triplets from Amazon’s Q2 2023 10Q report. We generated responses from three anonymized generative AI pipelines and calculated QA Accuracy and Factual Knowledge metrics.

Our primary findings emphasize that ground truth curation and metric interpretation are tightly coupled. Ground truth should be curated with the measurement approach in mind, and metrics can update the ground truth during golden dataset development. We further recommend curating separate ground truths for QA accuracy and factual knowledge, particularly emphasizing setting a desired level of verbosity according to user experience goals, and setting golden questions with unambiguous interpretations. Closeness and conciseness to ground truth are valid interpretations of FMEval recall and precision metrics, and factual knowledge scores can be used to detect hallucinations. Ultimately, the quantification of the expected user experience in the form of a golden dataset for pipeline evaluation with FMEval supports business decision-making, such as choosing between pipeline options, projecting quality changes from development to production, and adhering to legal and compliance requirements.

Whether you are building an internal application, a customer-facing virtual assistant, or exploring the potential of generative AI for your business, this post can help you use FMEval to make sure your projects meet the highest standards of quality and responsibility. We encourage you to adopt these best practices and start evaluating your generative AI question answering pipelines with the FMEval toolkit today.


About the Authors

Samantha Stuart is a Data Scientist with AWS Professional Services, and has delivered for customers across generative AI, MLOps, and ETL engagements. Samantha has a research master’s degree in engineering from the University of Toronto, where she authored several publications on data-centric AI for drug delivery system design. Outside of work, she is most likely spotted playing music, spending time with friends and family, at the yoga studio, or exploring Toronto.

Rahul Jani is a Data Architect with AWS Professional Services. He collaborates closely with enterprise customers building modern data platforms, generative AI applications, and MLOps. He is specialized in the design and implementation of big data and analytical applications on the AWS platform. Beyond work, he values quality time with family and embraces opportunities for travel.

Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.

Andrei Ivanovic is a Data Scientist with AWS Professional Services, with experience delivering internal and external solutions in generative AI, AI/ML, time series forecasting, and geospatial data science. Andrei has a Master’s in CS from the University of Toronto, where he was a researcher at the intersection of deep learning, robotics, and autonomous driving. Outside of work, he enjoys literature, film, strength training, and spending time with loved ones.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

生成式 AI 问答 FMEval 评估 最佳实践
相关文章