Ground truth curation and metric interpretation best practices for evaluating generative AI question answering using FMEval

Generative artificial intelligence (AI) applications powered by large language models (LLMs) are rapidly gaining traction for question answering use cases. From internal knowledge bases for customer support to external conversational AI assistants, these applications use LLMs to provide human-like responses to natural language queries. However, building and deploying such assistants with responsible AI best practices requires a robust ground truth and evaluation framework to make sure they meet quality standards and user experience expectations, as well as clear evaluation interpretation guidelines to make the quality and responsibility of these systems intelligible to business decision-makers.

This post focuses on evaluating and interpreting metrics using FMEval for question answering in a generative AI application. FMEval is a comprehensive evaluation suite from Amazon SageMaker Clarify, providing standardized implementations of metrics to assess quality and responsibility. To learn more about FMEval, refer to Evaluate large language models for quality and responsibility.

In this post, we discuss best practices for working with FMEval in ground truth curation and metric interpretation for evaluating question answering applications for factual knowledge and quality. Ground truth data in AI refers to data that is known to be true, representing the expected outcome for the system being modeled. By providing a true expected outcome to measure against, ground truth data unlocks the ability to deterministically evaluate system quality. Ground truth curation and metric interpretation are tightly coupled, and the implementation of the evaluation metric must inform ground truth curation to achieve best results. By following these guidelines, data scientists can quantify the user experience delivered by their generative AI pipelines and communicate meaning to business stakeholders, facilitating ready comparisons across different architectures, such as Retrieval Augmented Generation (RAG) pipelines, off-the-shelf or fine-tuned LLMs, or agentic solutions.

Solution overview

We use an example ground truth dataset (referred to as the golden dataset, shown in the following table) of 10 question-answer-fact triplets. Each triplet describes a fact, and an encapsulation of the fact as a question-answer pair to emulate an ideal response, derived from a knowledge source document. We used Amazon’s Q2 2023 10Q report as the source document from the SEC’s public EDGAR dataset to create 10 question-answer-fact triplets. The 10Q report contains details on company financials and operations over the Q2 2023 business quarter. The golden dataset applies the ground truth curation best practices discussed in this post for most questions, but not all, to demonstrate the downstream impact of ground truth curation on metric results.

Question	Answer	Fact
Who is Andrew R. Jassy?	Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc.	Chief Executive Officer of Amazon<OR>CEO of Amazon<OR>President of Amazon
What were Amazon’s total net sales for the second quarter of 2023?	Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.	134.4 billion<OR>134,383 million<OR>134183 million<OR>134.383 billion
Where is Amazon’s principal office located?	Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.	410 Terry Avenue North
What was Amazon’s operating income for the six months ended June 30, 2023?	Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion.	12.5 billion<OR>12,455 million<OR>12.455 billion
When did Amazon acquire One Medical?	Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired.	Feb 22 2023<OR>February 22nd 2023<OR>2023-02-22<OR>February 22, 2023
What was a key challenge faced by Amazon’s business in the second quarter of 2023?	Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023.	foreign exchange rates
What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023?	Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion.	50.1 billion<OR>50,067 million<OR>50.067 billion
What were Amazon’s AWS sales for the second quarter of 2023?	Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion.	22.1 billion<OR>22,140 million<OR>22.140 billion<OR>22140 million
As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold?	As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock.	158 million
How many shares of common stock were outstanding as of July 21, 2023?	There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.	10317750796<OR>10,317,750,796

We generated responses from three generative AI RAG pipelines (anonymized as Pipeline1, Pipeline2, Pipeline3, as shown in the following figure) and calculated factual knowledge and QA accuracy metrics, evaluating them against the golden dataset. The fact key of the triplet is used for the Factual Knowledge metric ground truth, and the answer key is used for the QA Accuracy metric ground truth. With this, factual knowledge is measured against the fact key, and the ideal user experience in terms of style and conciseness is measured against the question-answer pairs.

Evaluation for question answering in a generative AI application

A generative AI pipeline can have many subcomponents, such as a RAG pipeline. RAG is a methodology to improve the accuracy of LLM responses answering a user query by retrieving and inserting relevant domain knowledge into the language model prompt. RAG quality depends on the configurations of the retriever (chunking, indexing) and generator (LLM selection and hyperparameters, prompt), as illustrated in the following figure. Tuning chunking and indexing in the retriever makes sure the correct content is available in the LLM prompt for generation. The chunk size and chunk splitting method, as well as the means of embedding and ranking relevant document chunks as vectors in the knowledge store, impacts whether the actual answer to the query is ultimately inserted in the prompt. In the generator, selecting an appropriate LLM to run the prompt, and tuning its hyperparameters and prompt template, all control how the retrieved information is interpreted for the response. With this, when a final response from a RAG pipeline is evaluated, the preceding components may be adjusted to improve response quality.

Alternatively, question answering can be powered by a fine-tuned LLM, or through an agentic approach. Although we demonstrate the evaluation of final responses from RAG pipelines, the final responses from a generative AI pipeline for question answering can be similarly evaluated because the prerequisites are a golden dataset and the generative answers. With this approach, changes in the generative output due to different generative AI pipeline architectures can be evaluated to inform the best design choices (comparing RAG and knowledge retrieval agents, comparing LLMs used for generation, retrievers, chunking, prompts, and so on).

Although evaluating each sub-component of a generative AI pipeline is important in development and troubleshooting, business decisions rely on having an end-to-end, side-by-side data view, quantifying how a given generative AI pipeline will perform in terms of user experience. With this, business stakeholders can understand expected quality changes in terms of end-user experience by switching LLMs, and adhere to legal and compliance requirements, such as ISO42001 AI Ethics. There are further financial benefits to realize; for example, quantifying expected quality changes on internal datasets when switching a development LLM to a cheaper, lightweight LLM in production. The overall evaluation process for the benefit of decision-makers is outlined in the following figure. In this post, we focus our discussion on ground truth curation, evaluation, and interpreting evaluation scores for entire question answering generative AI pipelines using FMEval to enable data-driven decision-making on quality.

A useful mental model for ground truth curation and improvement of a golden dataset is a flywheel, as shown in the following figure. The ground truth experimentation process involves querying your generative AI pipeline with the initial golden dataset questions and evaluating the responses against initial golden answers using FMEval. Then, the quality of the golden dataset must be reviewed by a judge. The judge review of the golden dataset quality accelerates the flywheel towards an ever-improving golden dataset. The judge role in the workflow can be assumed by another LLM to enable scaling against established, domain-specific criteria for high-quality ground truth. Maintaining a human-in-the-loop component to the judge function remains essential to sample and verify results, as well as to increase the quality bar with increasing task complexity. Improvement to the golden dataset fosters improvement to the quality of the evaluation metrics, until sufficient measurement accuracy in the flywheel is met by the judge, using the established criteria for quality. To learn more about AWS offerings on human review of generations and data labeling, such as Amazon Augmented AI (Amazon A2I) and Amazon SageMaker Ground Truth Plus, refer to Using Amazon Augmented AI for Human Review and High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus. When using LLMs as a judge, make sure to apply prompt safety best practices.

However, to conduct reviews of golden dataset quality as part of the ground truth experiment flywheel, human reviewers must understand the evaluation metric implementation and its coupling to ground truth curation.

FMEval metrics for question answering in a generative AI application

The Factual Knowledge and QA Accuracy metrics from FMEval provide a way to evaluate custom question answering datasets against ground truth. For a full list of metrics implemented with FMEval, refer to Using prompt datasets and available evaluation dimensions in model evaluation jobs.

Factual Knowledge

The Factual Knowledge metric evaluates whether the generated response contains factual information present in the ground truth answer. It is a binary (0 or 1) score based on a string match. Factual knowledge also reports a quasi-exact string match which performs matching after normalization. For simplicity, we focus on the exact match Factual Knowledge score in this post.

For each golden question:

0 indicates the lowercased factual ground truth is not present in the model response 1 indicates the lowercased factual ground truth is present in the response

QA Accuracy

The QA Accuracy metric measures a model’s question answering accuracy by comparing its generated answers against ground truth answers. The metrics are computed by string matching true positive, false positive, and false negative word matches between QA ground truth answers and generated answers.

It includes several sub-metrics:

Recall Over Words

Precision Over Words

F1 Over Words

Exact Match

Quasi Exact Match

Because QA Accuracy metrics are calculated on an exact match basis, (for more details, see Accuracy) they may be less reliable for questions where the answer can be rephrased without modifying its meaning. To mitigate this, we propose applying Factual Knowledge as the assessment of factual correctness, motivating the use of a dedicated factual ground truth with minimal word expression, together with QA Accuracy as a measure of idealized user experience in terms of response verbosity and style. We elaborate on these concepts later in this post. The BERTScore is also computed as part of QA Accuracy, which provides a measure of semantic match quality against the ground truth.

Proposed ground truth curation best practices for question answering with FMEval

In this section, we share best practices for curating your ground truth for question answering with FMEval.

Understanding the Factual Knowledge metric calculation

A factual knowledge score is a binary measure of whether a real-world fact was correctly retrieved by the generative AI pipeline. 0 indicates the lower-cased expected answer is not part of the model response, whereas 1 indicates it is. Where there is more than one acceptable answer, and either answer is considered correct, apply a logical operator for OR. A configuration for a logical AND can also be applied for cases where the factual material encompasses multiple distinct entities. In the present examples, we demonstrate a logical OR, using the <OR> delimiter. See Use SageMaker Clarify to evaluate large language models for information about logical operators. An example curation of a golden question and golden fact is shown in the following table.

Golden Question	“How many shares of common stock were outstanding as of July 21, 2023?”
Golden Fact	10,317,750,796<OR>10317750796

Fact detection is useful for assessing hallucination in a generative AI pipeline. The two sample responses in the following table illustrate fact detection. The first example correctly states the fact in the example response, and receives a 1.0 score. The second example hallucinates a number instead of stating the fact, and receives a 0 score.

Metric	Example Response	Score	Calculation Approach
Factual Knowledge	“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”	1.0	String match to golden fact
Factual Knowledge	“Based on the documents provided, Amazon had 22,003,237,746 shares of common stock outstanding as of July 21, 2023.”	0.0	String match to golden fact

In the following example, we highlight the importance of units in ground truth for Factual Knowledge string matching. The golden question and golden fact represent Amazon’s total net sales for the second quarter of 2023.

Golden Question	“What were Amazon’s total net sales for the second quarter of 2023?
Golden Fact	134.4 billion<OR>134,383 million

The first response hallucinates the fact, using units of billions, and correctly receives a score of 0.0. The second response correctly represents the fact, in units of millions. Both units should be represented in the golden fact. The third response was unable to answer the question, flagging a potential issue with the information retrieval step.

Metric	Example Response	Score	Calculation Approach
Factual Knowledge	Amazon’s total net sales for the second quarter of 2023 were $170.0 billion.	0.0	String match to golden fact
	The total consolidated net sales for Q2 2023 were $134,383 million according to this report.	1.0
	Sorry, the provided context does not include any information about Amazon’s total net sales for the second quarter of 2023. Would you like to ask another question?	0.0

Interpreting Factual Knowledge scores

Factual knowledge scores are a useful flag for challenges in the generative AI pipeline such as hallucination or information retrieval problems. Factual knowledge scores can be curated in the form of a Factual Knowledge Report for human review, as shown in the following table, to visualize pipeline quality in terms of fact detection side by side.

User Question	QA Ground Truth	Factual Ground Truth	Pipeline 1	Pipeline 2	Pipeline 3
As of June 30, 2023, how many shares of Rivian’s Class A common stock did Amazon hold?	As of June 30, 2023, Amazon held 158 million shares of Rivian’s Class A common stock.	158 million	1	1	1
How many shares of common stock were outstanding as of July 21, 2023?	There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.	10317750796<OR>10,317,750,796	1	1	1
What was Amazon’s operating income for the six months ended June 30, 2023?	Amazon’s operating income for the six months ended June 30, 2023 was $12.5 billion.	12.5 billion<OR>12,455 million<OR>12.455 billion	1	1	1
What was Amazon’s total cash, cash equivalents and restricted cash as of June 30, 2023?	Amazon’s total cash, cash equivalents, and restricted cash as of June 30, 2023 was $50.1 billion.	50.1 billion<OR>50,067 million<OR>50.067 billion	1	0	0
What was a key challenge faced by Amazon’s business in the second quarter of 2023?	Changes in foreign exchange rates reduced Amazon’s International segment net sales by $180 million for Q2 2023.	foreign exchange rates	0	0	0
What were Amazon’s AWS sales for the second quarter of 2023?	Amazon’s AWS sales for the second quarter of 2023 were $22.1 billion.	22.1 billion<OR>22,140 million<OR>22.140 billion<OR>22140 million	1	0	0
What were Amazon’s total net sales for the second quarter of 2023?	Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.	134.4 billion<OR>134,383 million<OR>134183 million<OR>134.383 billion	1	0	0
When did Amazon acquire One Medical?	Amazon acquired One Medical on February 22, 2023 for cash consideration of approximately $3.5 billion, net of cash acquired.	Feb 22 2023<OR>February 22nd 2023<OR>2023-02-22<OR>February 22, 2023	1	0	1
Where is Amazon’s principal office located?	Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.	410 Terry Avenue North	0	0	0
Who is Andrew R. Jassy?	Andrew R. Jassy is the President and Chief Executive Officer of Amazon.com, Inc.	Chief Executive Officer of Amazon<OR>CEO of Amazon<OR>President of Amazon	1	1	1

Curating Factual Knowledge ground truth

Consider the impact of string matching between your ground truth and LLM responses when curating ground truth for Factual Knowledge. Best practices for curation in consideration of string matching are the following:

Use a minimal version of the QA Accuracy ground truth for a factual ground truth containing the most important facts

Zero factual knowledge scores across the benchmark can indicate a poorly formed golden question-answer-fact triplet

Generate many variations of fact representation in terms of units and punctuation

Avoid false positive matches

Understanding QA Accuracy metric calculation

We use the following question answer pair to demonstrate how FMEval metrics are calculated, and how this informs best practices in QA ground truth curation.

Golden Question	“How many shares of common stock were outstanding as of July 21, 2023?”
Golden Answer	“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

In calculating QA Accuracy metrics, first the responses and ground truth are first normalized (lowercase, remove punctuation, remove articles, remove excess whitespace). Then, true positive, false positives, and false negative matches are computed between the LLM response and the ground truth. QA Accuracy metrics returned by FMEval include recall, precision, F1. By assessing exact matching, the Exact Match and Quasi-Exact Match metrics are returned. A detailed walkthrough of the calculation and scores are shown in the following tables.

The first table illustrates the accuracy metric calculation mechanism.

Metric	Definition	Example	Score
True Positive (TP)	The number of words in the model output that are also contained in the ground truth.	Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”	11
False Positive (FP)	The number of words in the model output that are not contained in the ground truth.	Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”	7
False Negative (FN)	The number of words that are missing from the model output, but are included in the ground truth.	Golden Answer: “There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.” Example Response: “Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.”	3

The following table lists the accuracy scores.

Metric	Score	Calculation Approach
Recall Over Words	0.786
Precision Over Words	0.611
F1	0.688
Exact Match	0.0	(Non-normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.
Quasi-Exact Match	0.0	(Normalized) Binary score that indicates whether the model output is an exact match for the ground truth answer.

Interpreting QA Accuracy scores

The following are best practices for interpreting QA accuracy scores:

Interpret recall as closeness to ground truth

The higher the recall score, the more ground truth is included in the model response. If the entire ground truth is included in the model response, recall will be perfect (1.0), and if no ground truth is included in the model, response recall will be zero (0.0). Low recall in response to a golden question can indicate a problem with information retrieval, as shown in the example in the following table. A high recall score, however, doesn’t unilaterally indicate a correct response. Hallucinations of facts can present as a single deviated word between model response and ground truth, while still yielding a high true positive rate in word matching. For such cases, you can complement QA Accuracy scores with Factual Knowledge assessments of golden questions in FMEval (we provide examples later in this post).

Interpretation	Question	Curated Ground Truth	High Closeness to Ground Truth		Low Closeness to Ground Truth
Interpreting Closeness to Ground Truth Scores	“How many shares of common stock were outstanding as of July 21, 2023?”	“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”	“As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.”	0.923	“Sorry, I do not have access to documents containing common stock information about Amazon.”	0.111

Interpret precision as conciseness to ground truth

Interpretation

Question

Curated Ground Truth

High Conciseness to Ground Truth

Low Conciseness to Ground Truth

Interpreting Conciseness to Ground Truth

“How many shares of common stock were outstanding as of July 21, 2023?”

“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.

1.0

“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.

Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states:

‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’

Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.”

0.238

Interpret F1 score as combined closeness and conciseness to ground truth

Interpretation

Question

Curated Ground Truth

High Combined Closeness x Conciseness

Low Combined Closeness x Conciseness

Interpreting Closeness and Conciseness to Ground Truth

“How many shares of common stock were outstanding as of July 21, 2023?”

“There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.”

“As of July 21, 2023, there were 10,317,750,796 shares of common stock outstanding.”

0.96

“Based on the documents provided, Amazon had 10,317,750,796 shares of common stock outstanding as of July 21, 2023.

Specifically, in the first excerpt from the quarterly report for the quarter ending June 30, 2023, it states:

‘10,317,750,796 shares of common stock, par value $0.01 per share, outstanding as of July 21, 2023’

Therefore, the number of shares of Amazon common stock outstanding as of July 21, 2023 was 10,317,750,796 according to this statement.”

0.364

Combine factual knowledge with recall for detection of hallucinated facts and false fact matches

High recall with zero factual knowledge suggests a hallucinated fact. Zero recall with positive factual knowledge suggests an accidental match between the factual ground truth and an unrelated entity such as a document ID, phone number, or date. Low recall and zero factual knowledge may also suggest a correct answer that has been expressed with alternative language to the QA ground truth. Improved ground truth curation (increased question specificity, more ground truth fact variants) can remediate this problem. The BERTScore can also provide semantic context on match quality.

Interpretation	QA Ground Truth	Factual Ground Truth	Factual Knowledge	Recall Score	LLM response
Hallucination detection	Amazon’s total net sales for the second quarter of 2023 were $134.4 billion.	134.4 billion<OR>134,383 million	0	0.92	Amazon’s total net sales for the second quarter of 2023 were $170.0 billion.
Detect false positive facts	There were 10,317,750,796 shares of Amazon’s common stock outstanding as of July 21, 2023.	10317750796<OR> 10,317,750,796	1.0	0.0	Document ID: 10317750796
Correct answer, expressed in different words to ground truth question-answer-fact	Amazon’s principal office is located at 410 Terry Avenue North, Seattle, Washington 98109-5210.	410 Terry Avenue North	0	0.54	Amazon’s principal office is located in Seattle, Washington.

Curating QA Accuracy ground truth

Consider the impact of true positive, false positive, and false negative matches between your golden answer and LLM responses when curating your ground truth for QA Accuracy. Best practices for curation in consideration of string matching are as follows:

Use LLMs to generate initial golden questions and answers

Human review golden answers for proximity to desired output

Compare LLM accuracy using recall

Compare verbosity using precision

Experiment to determine recall acceptability thresholds for generative AI pipelines

Interpret QA accuracy metrics in conjunction with other metrics to pass judgement on accuracy

Key takeaways

Curating appropriate ground truth and interpreting evaluation metrics in a feedback loop is crucial for effective business decision-making when deploying generative AI pipelines for question answering.

There were several key takeaways from this experiment:

Ground truth curation and metric interpretation are a cyclical process

Low-scoring evaluations can indicate problems with ground truth curation in addition to generative AI pipeline quality

Balance recall, precision, and F1 scores

Design ground truth verbosity to the level desired in your user experience

Use recall and factual knowledge for setting accuracy thresholds

Curate distinct QA and factual ground truths

Golden questions should be unambiguous

Automate, but verify, with LLMs

Conclusion

In this post, we outlined best practices for ground truth curation and metric interpretation when evaluating generative AI question answering using FMEval. We demonstrated how to curate ground truth question-answer-fact triplets in consideration of the Factual Knowledge and QA Accuracy metrics calculated by FMEval. To validate our approach, we curated a golden dataset of 10 question-answer-fact triplets from Amazon’s Q2 2023 10Q report. We generated responses from three anonymized generative AI pipelines and calculated QA Accuracy and Factual Knowledge metrics.

Our primary findings emphasize that ground truth curation and metric interpretation are tightly coupled. Ground truth should be curated with the measurement approach in mind, and metrics can update the ground truth during golden dataset development. We further recommend curating separate ground truths for QA accuracy and factual knowledge, particularly emphasizing setting a desired level of verbosity according to user experience goals, and setting golden questions with unambiguous interpretations. Closeness and conciseness to ground truth are valid interpretations of FMEval recall and precision metrics, and factual knowledge scores can be used to detect hallucinations. Ultimately, the quantification of the expected user experience in the form of a golden dataset for pipeline evaluation with FMEval supports business decision-making, such as choosing between pipeline options, projecting quality changes from development to production, and adhering to legal and compliance requirements.

Whether you are building an internal application, a customer-facing virtual assistant, or exploring the potential of generative AI for your business, this post can help you use FMEval to make sure your projects meet the highest standards of quality and responsibility. We encourage you to adopt these best practices and start evaluating your generative AI question answering pipelines with the FMEval toolkit today.

About the Authors

Samantha Stuart is a Data Scientist with AWS Professional Services, and has delivered for customers across generative AI, MLOps, and ETL engagements. Samantha has a research master’s degree in engineering from the University of Toronto, where she authored several publications on data-centric AI for drug delivery system design. Outside of work, she is most likely spotted playing music, spending time with friends and family, at the yoga studio, or exploring Toronto.

Rahul Jani is a Data Architect with AWS Professional Services. He collaborates closely with enterprise customers building modern data platforms, generative AI applications, and MLOps. He is specialized in the design and implementation of big data and analytical applications on the AWS platform. Beyond work, he values quality time with family and embraces opportunities for travel.

Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.

Andrei Ivanovic is a Data Scientist with AWS Professional Services, with experience delivering internal and external solutions in generative AI, AI/ML, time series forecasting, and geospatial data science. Andrei has a Master’s in CS from the University of Toronto, where he was a researcher at the intersection of deep learning, robotics, and autonomous driving. Outside of work, he enjoys literature, film, strength training, and spending time with loved ones.