This cookbook provides a practical, end-to-end guide on how to effectively useevals as the core process in creating a production-grade autonomous system toreplace a labor-intensive human workflow. It's a direct product of collaborativeexperience dealing with projects where users may not have started with pristinelabeled data or a perfect understanding of the problem - two issues that most tutorials glossover but are in practice almost always serious challenges.
Making evals the core process prevents poke-and-hope guesswork and impressionisticjudgments of accuracy, instead demanding engineering rigor. This means we can makeprincipled decisions about cost trade-offs and investment.
This guide is designed for ML/AI engineers and Solution Architects who arelooking for practical guidance beyond introductory tutorials. This notebook is fullyexecutable and organized to be as modular as possible to support using codesamples directly in your own applications.
We'll follow a realistic storyline: replacing a manual receipt-analysis service for validating expenses.
Start Small: Begin with a very small set of labeled data (retail receipts). Many businesses don't have good ground truth data sets.Build Incrementally: Develop a minimal viable system and establish initial evals.Business Alignment: Evaluate eval performance in the context of business KPIs anddollar impact, and target efforts to avoid working on low-impact improvements.Eval-Driven Iteration: Iteratively improve by using eval scores to power modelimprovements, then by using better models on more data to expand evals and identify moreareas for improvement.
This cookbook is structured as an eval-centric guide through the lifecycle of buildingan LLM application.
If you're primarily interested in the ideas presented, read through the text and skim overthe code.If you're here because of something else you're working on, you can go ahead and jump to thatsection and dig into the code there, copy it, and adapt it to your needs.If you want to really understand how this all works, download this notebook and runthe cells as you read through it; edit the code to make your own changes, test yourhypotheses, and make sure you actually understand how it all works together.
Note: If your OpenAI organization has a Zero Data Retention (ZDR) policy, Evals dashboards and logs will not be available, since prompts and responses are not stored. This may limit visibility into eval results for compliance-focused enterprise accounts.
In order to condense this guide we'll be using a small hypothetical problem that's still complexenough to merit detailed and multi-faceted evals. In particular, we'll be focused on howto solve a problem given a limited amount of data to work with, so we're working with adataset that's quite small.
For this guide, we assume that we are starting with a workflow for reviewing and filingreceipts. While in general, this is a problem that already has a lot of establishedsolutions, it's analogous to other problems that don't have nearly so much prior work;further, even when good enterprise solutions exist there is often still a"last mile" problem that still requires human time.
In our case, we'll assume we have a pipeline where:
People upload photos of receiptsAn accounting team reviews each receipt to categorize and approve or audit the expense
Based on interviews with the accounting team, they make their decisions based on
MerchantGeographic locationExpense amountItems or services purchasedHandwritten notes or annotations
Our system will be expected to handle most receipts without any human intervention, butescalate low-confidence decisions for human QA. We'll be focused on reducing the totalcost of the accounting process, which is dependent on
How much the previous / current system cost to run per-receiptHow many receipts the new system sends to QAHow much the system costs to run per-receipt, plus any fixed costsWhat the business impact is of mistakes, either receipts kicked out for review or mistakes missedThe cost of engineering to develop and integrate the system
The receipt images come from the CC by 4.0 licensedReceipt Handwriting Detection Computer Vision Projectdataset published by Roboflow. We've added our own labels and narrative spin in order totell a story with a small number of examples.
Not every project will proceed in the same way, but projects generally have someimportant components in common.
The solid arrows show the primary progressions or steps, while the dotted linerepresents the ongoing nature of problem understanding - uncovering more aboutthe customer domain will influence every step of the process. We wil examineseveral of these iterative cycles of refinement in detail below.Not every project will proceed in the same way, but projects generally have some commonimportant components.
Usually, the decision to start an engineering process is made by leadership whounderstand the business impact but don't need to know the process details. In ourexample, we're building a system designed to replace a non-AI workflow. In a sense thisis ideal: we have a set of domain experts, the people currently doing the task who wecan interview to understand the task details and who we can lean upon to help developappropriate evals.
This step doesn't end before we start building our system; invariably, our initialassessments are an incomplete understanding of the problem space and we will continue torefine our understanding as we get closer to a solution.
It's very rare for a real-world project to begin with all the data necessary to achieve a satisfactory solution, let alone establish confidence.
In our case, we'll assume we have a decent sample of system inputs, in the form of but receipt images, but start without any fully annotated data. We find this is a not-unusual situation when automating an existing process. We'll walk through the process of incrementally expanding our test and training sets in collaboration with domain experts as we go along and make our evals progressively more comprehensive.
We want to get the skeleton of a system built as quickly as possible. We don't need asystem that performs well - we just need something that accepts the right inputs andprovides outputs of the correct type. Usually this is almost as simple as describing thetask in a prompt, adding the inputs, and using a single model (usually with structuredoutputs) to make an initial best-effort attempt.
We've found that in the absence of an established ground truth, it's not uncommon touse an early version of a system to generate 'draft' truth data which can be annotatedor corrected by domain experts.
Once we have an end-to-end system constructed, we can start processing the inputs wehave to generate plausible outputs. We'll send these to our domain experts to gradeand correct. We will use these corrections and conversations about how the expertsare making their decisions to design further evals and to embed expertise in the system.
Before we jump into correcting every error, we need to make sure that we're investingtime effectively. The most critical task at this stage is to review our evals andgain an understanding of how they connect to our key objectives.
Step back and assess the potential costs and benefits of the systemIdentify which eval measurements speak directly to those costs and benefitsFor example, what does "failure" on a particular eval cost? Are we measuringsomething worthwhile?Create a (non-LLM) model that uses eval metrics to provide a dollar valueBalance performance (accuracy, or speed) with cost to develop and run
Having identified which efforts are most worth making, we can begin iterating onimprovements to the system. The evals act as an objective guide so we know when we'vemade the system good enough, and ensure we avoid or identify regression.
Evals aren't just for development. Instrumenting all or a portion of a productionservice will surface more useful test and training samples over time, identifyingincorrect assumptions or finding areas with insufficient coverage. This is also the onlyway you can ensure that your models continue performing well long after your initialdevelopment process is complete.
In practice, we would probably be building a system that operates via a REST API,possibly with some web frontend that would have access to some set of components andresources. For the purposes of this cookbook, we'll distill that down to a pair offunctions, extract_receipt_details and evaluate_receipt_for_audit that collectivelydecide what we should do with a given receipt.
extract_receipt_details will take an image as input and produce structured outputcontaining important details about the receipt.evaluate_receipt_for_audit will take that structure as input and decide whether ornot the receipt should be audited.
Breaking up a process into steps like this has both pros and cons; it is easier toexamine and develop if the process is made up of small isolated steps. But you canprogressively lose information, effectively letting your agents play "telephone". Inthis notebook we break up the steps and don't let the auditor see the actual receiptbecause it's more instructive for the evals we want to discuss.
We'll start with the first step, the literal data extraction. This is intermediatedata: it's information that people would examine implicitly, but often isn't recorded.And for this reason, we often don't have labeled data to work from.
%pip install --upgrade openai pydantic python-dotenv rich persist-cache -qqq%load_ext dotenv%dotenv# Place your API key in a file called .env# OPENAI_API_KEY=sk-...
Capture the meaningful information in a structured output.
Note: Normally we would use decimal.Decimal objects for the numbers above and datetime.datetime objects for time field, but neither of those deserialize well. For the purposes of this cookbook, we'll work with strings, but in practice you'd want to have another level of translation to get the correct output validated.
Let's build our extract_receipt_details function.
Usually, for the very first stab at something that might work, we'll simply feed ChatGPTthe available documents we've assembled so far and ask it to generate a prompt. It's notworth spending too much time on prompt engineering before you have a benchmark to gradeyourself against! This is a prompt produced by o4-mini based on the problem descriptionabove.
BASIC_PROMPT="""Given an image of a retail receipt, extract all relevant information and format it as a structured response.# Task DescriptionCarefully examine the receipt image and identify the following key information:1. Merchant name and any relevant store identification2. Location information (city, state, ZIP code)3. Date and time of purchase4. All purchased items with their: * Item description/name * Item code/SKU (if present) * Category (infer from context if not explicit) * Regular price per item (if available) * Sale price per item (if discounted) * Quantity purchased * Total price for the line item5. Financial summary: * Subtotal before tax * Tax amount * Final total6. Any handwritten notes or annotations on the receipt (list each separately)## Important Guidelines* If information is unclear or missing, return null for that field* Format dates as ISO format (YYYY-MM-DDTHH:MM:SS)* Format all monetary values as decimal numbers* Distinguish between printed text and handwritten notes* Be precise with amounts and totals* For ambiguous items, use your best judgment based on contextYour response should be structured and complete, capturing all available informationfrom the receipt."""
import base64import mimetypesfrom pathlib import Pathfrom openai import AsyncOpenAIclient = AsyncOpenAI()asyncdefextract_receipt_details( image_path: str, model: str="o4-mini") -> ReceiptDetails:"""Extract structured details from a receipt image."""# Determine image type for data URI. mime_type, _ = mimetypes.guess_type(image_path)# Read and base64 encode the image. b64_image = base64.b64encode(Path(image_path).read_bytes()).decode("utf-8") image_data_url =f"data:{mime_type};base64,{b64_image}" response =await client.responses.parse(model=model,input=[ {"role": "user","content": [ {"type": "input_text", "text": BASIC_PROMPT}, {"type": "input_image", "image_url": image_data_url}, ], } ],text_format=ReceiptDetails, )return response.output_parsed
Let's evaluate just a single receipt and review it manually to see how well a smart model with a naive prompt can do.
The model extracted a lot of things correctly, but renamed some of the lineitems - incorrectly, in fact. More importantly, it got some of the prices wrong, and itdecided not to categorize any of the line items.
That's okay, we don't expect to have perfect answers at this point! Instead, ourobjective is to build a basic system we can evaluate. Then, when we start iterating, wewon't be 'vibing' our way to something that looks better -- we'll be engineering areliable solution. But first, we'll add an action decision to complete our draft system.
Next, we need to close the loop and get to an actual decision based on receipts. Thislooks pretty similar, so we'll present the code without comment.
Ordinarily one would start with the most capable model - o3, at this time - for afirst pass, and then once correctness is established experiment with different modelsto analyze any tradeoffs for their business impact, and potentially consider whetherthey are remediable with iteration. A client may be willing to take a certain accuracyhit for lower latency or cost, or it may be more effective to change the architectureto hit cost, latency, and accuracy goals. We'll get into how to make these tradeoffsexplicitly and objectively later on.
For this cookbook, o3 might be too good. We'll use o4-mini for our first pass, sothat we get a few reasoning errors we can use to illustrate the means of addressingthem when they occur.
Next, we need to close the loop and get to an actual decision based on receipts. Thislooks pretty similar, so we'll present the code without comment.
from pydantic import BaseModel, Fieldaudit_prompt ="""Evaluate this receipt data to determine if it need to be audited based on the followingcriteria:1. NOT_TRAVEL_RELATED: - IMPORTANT: For this criterion, travel-related expenses include but are not limited to: gas, hotel, airfare, or car rental. - If the receipt IS for a travel-related expense, set this to FALSE. - If the receipt is NOT for a travel-related expense (like office supplies), set this to TRUE. - In other words, if the receipt shows FUEL/GAS, this would be FALSE because gas IS travel-related.2. AMOUNT_OVER_LIMIT: The total amount exceeds $503. MATH_ERROR: The math for computing the total doesn't add up (line items don't sum to total)4. HANDWRITTEN_X: There is an "X" in the handwritten notesFor each criterion, determine if it is violated (true) or not (false). Provide yourreasoning for each decision, and make a final determination on whether the receipt needsauditing. A receipt needs auditing if ANY of the criteria are violated.Return a structured response with your evaluation."""classAuditDecision(BaseModel): not_travel_related: bool= Field(description="True if the receipt is not travel-related" ) amount_over_limit: bool= Field(description="True if the total amount exceeds $50") math_error: bool= Field(description="True if there are math errors in the receipt") handwritten_x: bool= Field(description="True if there is an 'X' in the handwritten notes" ) reasoning: str= Field(description="Explanation for the audit decision") needs_audit: bool= Field(description="Final determination if receipt needs auditing" )asyncdefevaluate_receipt_for_audit( receipt_details: ReceiptDetails, model: str="o4-mini") -> AuditDecision:"""Determine if a receipt needs to be audited based on defined criteria."""# Convert receipt details to JSON for the prompt receipt_json = receipt_details.model_dump_json(indent=2) response =await client.responses.parse(model=model,input=[ {"role": "user","content": [ {"type": "input_text", "text": audit_prompt}, {"type": "input_text", "text": f"Receipt details:\n{receipt_json}"}, ], } ],text_format=AuditDecision, )return response.output_parsed
A schematic of the https://cookbook.openai.com../../../images/partner_process_flowchart.pngimg src="../../../images/partner_process_flowchart.png" alt="Process Flowchart">
If we run our above example through this model, here's what we get -- again, we'll usean example result here. When you run the code you might get slightly different results.
audit_decision = AuditDecision(not_travel_related=True,amount_over_limit=True,math_error=False,handwritten_x=False,reasoning=""" The receipt from Walmart is for office supplies, which are not travel-related, thus NOT_TRAVEL_RELATED is TRUE. The total amount of the receipt is $54.96, which exceeds the limit of $50, making AMOUNT_OVER_LIMIT TRUE. The subtotal ($50.77) plus tax ($4.19) correctly sums to the total ($54.96), so there is no MATH_ERROR. There are no handwritten notes, so HANDWRITTEN_X is FALSE. Since two criteria (amount over limit and travel-related) are violated, the receipt needs auditing. """,needs_audit=True,)
This example illustrates why we care about end-to-end evals and why we can't use them inisolation. Here, the initial extraction had OCR errors and forwarded the prices to theauditor that don't add up to the total, but the auditor fails to detect it and assertsthere are no math errors. However, missing this doesn't change the audit decisionbecause it did pick up on the other two reasons the receipt needs to be audited.
Thus, AuditDecision is factually incorrect, but the decision that we care aboutis correct. This gives us an edge to improve upon, but also guides us toward makingsound choices for where and when we apply our engineering efforts.
With that said, let's build ourselves some evals!
Once we have a minimally functional system we should process more inputs and get domainexperts to help develop ground-truth data. Domain experts doing expert tasks may nothave much time to devote to our project, so we want to be efficient and start small,aiming for breadth rather than depth at first.
If your data doesn't require domain expertise, then you'd want to reach for alabeling solution (such as Label Studio) and attempt to annotateas much data as you can given the policy, budget, and data availability restrictions.In this case, we're going to proceed as if data labeling is a scarce resource; one wecan rely on for small amounts each week, but these are people with other jobresponsibilities whose time and willingness to help may be limited. Sitting with theseexperts to help annotate examples can help make selecting future examples moreefficient.
Because we have a chain of two steps, we'll be collecting tuples of type[FilePath, ReceiptDetails, AuditDecision]. Generally, the way to do this is to takeunlabeled samples, run them through our model, and then have experts correct the output.For the purposes of this notebook, we've already gone through that process for all thereceipt images in data/test.
There's a little more to it than that though, because when you are evaluating amultistep process it's important to know both the end to end performance and theperformance of each individual step, conditioned on the output of the prior step.
In this case, we want to evaluate:
Given an input image, how well do we extract the information we need?Given receipt information, how good is our judgement for our audit decision?Given an input image, how successful are we about making our final audit decision?
The phrasing difference between #2 and #3 is because if we give our auditor incorrectdata, we expect it to come to incorrect conclusions. What we want is to be confidentthat the auditor is making the correct decision based on the evidence available, even ifthat evidence is misleading. If we don't pay attention to that case, we can end uptraining the auditor to ignore its inputs and cause our overall performance to degrade.
The core component of an eval is thegrader. Our eventual eval is going touse 18 of them, but we only use three kinds, and they're all quite conceptuallystraightforward.
Here are examples of one of our string check graders, one of our text similaritygraders, and finally one of our model graders.
example_graders = [ {"name": "Total Amount Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_receipt_details.total }}","reference": "{{ item.correct_receipt_details.total }}", }, {"name": "Merchant Name Accuracy","type": "text_similarity","input": "{{ item.predicted_receipt_details.merchant }}","reference": "{{ item.correct_receipt_details.merchant }}","pass_threshold": 0.8,"evaluation_metric": "bleu", },]# A model grader needs a prompt to instruct it in what it should be scoring.missed_items_grader_prompt ="""Your task is to evaluate the correctness of a receipt extraction model.The following items are the actual (correct) line items from a specific receipt.{{ item.correct_receipt_details.items }}The following items are the line items extracted by the model.{{ item.predicted_receipt_details.items }}Score 0 if the sample evaluation missed any items from the receipt; otherwise score 1.The line items are permitted to have small differences or extraction mistakes, but eachitem from the actual receipt must be present in some form in the model's output. Onlyevaluate whether there are MISSED items; ignore other mistakes or extra items."""example_graders.append( {"name": "Missed Line Items","type": "score_model","model": "o4-mini","input": [{"role": "system", "content": missed_items_grader_prompt}],"range": [0, 1],"pass_threshold": 1, })
Each grader evaluates some portion of a predicted output. This might be a very narrowcheck for a specific field in a structured output, or a more holistic check thatjudges an output in its entirety. Some graders can work without context, and evaluate anoutput in isolation (for example, an LLM judge that is evaluating if a paragraph is rudeor inappropriate). Others can evaluate based on the input and output, while while theones we're using here rely on an output and a ground-truth (correct) output to compareagainst.
The most direct way of using Evals provides a prompt and a model, and lets the eval runon an input to generate output itself. Another useful method uses previously loggedresponses or completions as the source of the outputs. It's not quite as simple, but themost flexible thing we can do is to supply an item containing everything we want it touse—this allows us to have the "prediction" function be an arbitrary system rather thanrestricting it to a single model call. This is how we're using it in the examples below;the EvaluationRecord shown below will be used to populate the {{ }} templatevariables.
Note on Model Selection: Selecting the right model is crucial. While faster, less expensive models are often preferable in production, development workflows benefit from prioritizing the most capable models available. For this guide, we use o4-mini for both system tasks and LLM-based grading—while o3 is more capable, our experience suggests the difference in output quality is modest relative to the substantial increase in cost. In practice, spending $10+/day/engineer on evals is typical, but scaling to $100+/day/engineer may not be sustainable.
Nonetheless, it's valuable to periodically benchmark with a more advanced model like o3. If you observe significant improvements, consider incorporating it for a representative subset of your evaluation data. Discrepancies between models can reveal important edge cases and guide system improvements.
import asyncioclassEvaluationRecord(BaseModel):"""Holds both the correct (ground truth) and predicted audit decisions.""" receipt_image_path: str correct_receipt_details: ReceiptDetails predicted_receipt_details: ReceiptDetails correct_audit_decision: AuditDecision predicted_audit_decision: AuditDecisionasyncdefcreate_evaluation_record(image_path: Path, model: str) -> EvaluationRecord:"""Create a ground truth record for a receipt image.""" extraction_path = ground_truth_dir /"extraction"/f"{image_path.stem}.json" correct_details = ReceiptDetails.model_validate_json(extraction_path.read_text()) predicted_details =await extract_receipt_details(image_path, model) audit_path = ground_truth_dir /"audit_results"/f"{image_path.stem}.json" correct_audit = AuditDecision.model_validate_json(audit_path.read_text()) predicted_audit =await evaluate_receipt_for_audit(predicted_details, model)return EvaluationRecord(receipt_image_path=image_path.name,correct_receipt_details=correct_details,predicted_receipt_details=predicted_details,correct_audit_decision=correct_audit,predicted_audit_decision=predicted_audit, )asyncdefcreate_dataset_content( receipt_image_dir: Path, model: str="o4-mini") -> list[dict]:# Assemble paired samples of ground truth data and predicted results. You could# instead upload this data as a file and pass a file id when you run the eval. tasks = [ create_evaluation_record(image_path, model)for image_path in receipt_image_dir.glob("*.jpg") ]return [{"item": record.model_dump()} for record inawait asyncio.gather(*tasks)]file_content =await create_dataset_content(receipt_image_dir)
Once we have the graders and the data, creating and running our evals is very straightforward:
from persist_cache import cache# We're caching the output so that if we re-run this cell we don't create a new eval.@cacheasyncdefcreate_eval(name: str, graders: list[dict]): eval_cfg =await client.evals.create(name=name,data_source_config={"type": "custom","item_schema": EvaluationRecord.model_json_schema(),"include_sample_schema": False, # Don't generate new completions. },testing_criteria=graders, )print(f"Created new eval: {eval_cfg.id}")return eval_cfginitial_eval =await create_eval("Initial Receipt Processing Evaluation", example_graders)# Run the eval.eval_run =await client.evals.runs.create(name="initial-receipt-processing-run",eval_id=initial_eval.id,data_source={"type": "jsonl","source": {"type": "file_content", "content": file_content}, },)print(f"Evaluation run created: {eval_run.id}")print(f"View results at: {eval_run.report_url}")
After you run that eval you'll be able to view it in the UI, and should see somethinglike the below.
(Note, if you have a Zero-Data-Retention agreement, this data is not storehttps://cookbook.openai.com../../../images/partner_summary_ui.pngin this interface.)like:
Evals show you where you can improve, and help track progress and regressions over time.But the three evals above are just measurements — we need to imbue them with raisond'être.
The first thing we need is to add evaluations for the final stage of our receiptprocessing, so that we can start seeing the results of our audit decisions. The nextthing we need, the most important, is a model of business relevance.
It's almost never easy to work out what costs and benefits you could get out of a newsystem depending on how well it performs. Often people will avoid trying to putnumbers to things because they know how much uncertainty there is and they don't want tomake guesses that make them look bad. That's okay; we just have to make our best guess,and if we get more information later we can refine our model.
For this cookbook, we're going to create a simple cost structure:
our company processes 1 million receipts a year, at a baseline cost of $0.20 /receiptauditing a receipt costs about $2failing to audit a receipt we should have audited costs an average of $305% of receipts need to be auditedthe existing process
identifies receipts that need to be audited 97% of the timemisidentifies receipts that don't need to be audited 2% of the time
This gives us two baseline comparisons:
if we identified every receipt correctly, we would spend $100,000 on auditsour current process spends $135,000 on audits and loses $45,000 to un-audited expenses
On top of that, the human-driven process costs an additional $200,000.
We're expecting our service to save money by costing less to run (≈1¢/receipt if we usethe prompts from above with o4-mini), but whether we save or lose money on audits andmissed audits depends on how well our system performs. It might be worth writing this asa simple function — written below is a version that includes the above factors butneglects nuance and ignores development, maintenance, and serving costs.
The point of the above model is it lets us apply meaning to an eval that wouldotherwise just be a number. For instance, when we ran the system above we were wrong 85%of the time for merchant names. But digging in, it seems like most instances arecapitalization issues or "Shell Gasoline" vs. "Shell Oil #2144" — problems that whenwe follow through, do not appear to affect our audit decision or change our fundamentalcosts.
On the other hand, it seems like we fail to catch handwritten "X"s on receipts abouthalf the time, and about half of the time when there's an "X" on a receipt that getsmissed, it results in a receipt not getting audited when it should. Those areoverrepresented in our dataset, but if that makes up even 1% of receipts, that 50%failure would cost us $75,000 a year.
Similarly, it seems like we have OCR errors that cause us to audit receipts quite oftenon account of the math not working out, up to 20% of the time. This could cost us almost$400,000!
Now, we're in a place to add more graders and start working backwards from the auditdecision accuracy to determine which problems we should focus on.
Below are the rest of our graders and the results we get with our initial un-optimizedprompts. Note that at this point we do quite badly! Across our 20 samples (8 positive,12 negative), we had two false negatives and two false positives. If we extrapolated toour entire business, we'd be losing $375,000 on audits we missed and $475,000 onunnecessary audits.
simple_extraction_graders = [ {"name": "Merchant Name Accuracy","type": "text_similarity","input": "{{ item.predicted_receipt_details.merchant }}","reference": "{{ item.correct_receipt_details.merchant }}","pass_threshold": 0.8,"evaluation_metric": "bleu", }, {"name": "Location City Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_receipt_details.location.city }}","reference": "{{ item.correct_receipt_details.location.city }}", }, {"name": "Location State Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_receipt_details.location.state }}","reference": "{{ item.correct_receipt_details.location.state }}", }, {"name": "Location Zipcode Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_receipt_details.location.zipcode }}","reference": "{{ item.correct_receipt_details.location.zipcode }}", }, {"name": "Time Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_receipt_details.time }}","reference": "{{ item.correct_receipt_details.time }}", }, {"name": "Subtotal Amount Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_receipt_details.subtotal }}","reference": "{{ item.correct_receipt_details.subtotal }}", }, {"name": "Tax Amount Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_receipt_details.tax }}","reference": "{{ item.correct_receipt_details.tax }}", }, {"name": "Total Amount Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_receipt_details.total }}","reference": "{{ item.correct_receipt_details.total }}", }, {"name": "Handwritten Notes Accuracy","type": "text_similarity","input": "{{ item.predicted_receipt_details.handwritten_notes }}","reference": "{{ item.correct_receipt_details.handwritten_notes }}","pass_threshold": 0.8,"evaluation_metric": "fuzzy_match", },]item_extraction_base ="""Your task is to evaluate the correctness of a receipt extraction model.The following items are the actual (correct) line items from a specific receipt.{{ item.correct_receipt_details.items }}The following items are the line items extracted by the model.{{ item.predicted_receipt_details.items }}"""missed_items_instructions ="""Score 0 if the sample evaluation missed any items from the receipt; otherwise score 1.The line items are permitted to have small differences or extraction mistakes, but eachitem from the actual receipt must be present in some form in the model's output. Onlyevaluate whether there are MISSED items; ignore other mistakes or extra items."""extra_items_instructions ="""Score 0 if the sample evaluation extracted any extra items from the receipt; otherwisescore 1.The line items are permitted to have small differences or extraction mistakes, but eachitem from the actual receipt must be present in some form in the model's output. Onlyevaluate whether there are EXTRA items; ignore other mistakes or missed items."""item_mistakes_instructions ="""Score 0 to 10 based on the number and severity of mistakes in the line items.A score of 10 means that the two lists are perfectly identical.Remove 1 point for each minor mistake (typos, capitalization, category namedifferences), and up to 3 points for significant mistakes (incorrect quantity, price, ortotal, or categories that are not at all similar)."""item_extraction_graders = [ {"name": "Missed Line Items","type": "score_model","model": "o4-mini","input": [ {"role": "system","content": item_extraction_base + missed_items_instructions, } ],"range": [0, 1],"pass_threshold": 1, }, {"name": "Extra Line Items","type": "score_model","model": "o4-mini","input": [ {"role": "system","content": item_extraction_base + extra_items_instructions, } ],"range": [0, 1],"pass_threshold": 1, }, {"name": "Item Mistakes","type": "score_model","model": "o4-mini","input": [ {"role": "system","content": item_extraction_base + item_mistakes_instructions, } ],"range": [0, 10],"pass_threshold": 8, },]simple_audit_graders = [ {"name": "Not Travel Related Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_audit_decision.not_travel_related }}","reference": "{{ item.correct_audit_decision.not_travel_related }}", }, {"name": "Amount Over Limit Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_audit_decision.amount_over_limit }}","reference": "{{ item.correct_audit_decision.amount_over_limit }}", }, {"name": "Math Error Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_audit_decision.math_error }}","reference": "{{ item.correct_audit_decision.math_error }}", }, {"name": "Handwritten X Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_audit_decision.handwritten_x }}","reference": "{{ item.correct_audit_decision.handwritten_x }}", }, {"name": "Needs Audit Accuracy","type": "string_check","operation": "eq","input": "{{ item.predicted_audit_decision.needs_audit }}","reference": "{{ item.correct_audit_decision.needs_audit }}", },]reasoning_eval_prompt ="""Your task is to evaluate the quality of *reasoning* for audit decisions on receipts.Here are the rules for audit decisions:Expenses should be audited if they violate any of the following criteria:1. Expenses must be travel-related2. Expenses must not exceed $503. All math should be correct; the line items plus tax should equal the total4. There must not be an "X" in the handwritten notesIf ANY of those criteria are violated, the expense should be audited.Here is the input to the grader:{{ item.predicted_receipt_details }}Below is the output of an authoritative grader making a decision about whether or not toaudit an expense. This is a correct reference decision.GROUND TRUTH:{{ item.correct_audit_decision }}Here is the output of the model we are evaluating:MODEL GENERATED:{{ item.predicted_audit_decision }}Evaluate:1. For each of the 4 criteria, did the model correctly score it as TRUE or FALSE?2. Based on the model's *scoring* of the criteria (regardless if it scored it correctly), did the model reason appropriately about the criteria (i.e. did it understand and apply the prompt correctly)?3. Is the model's reasoning logically sound, sufficient, and comprehensible?4. Is the model's reasoning concise, without extraneous details?5. Is the final decision to audit or not audit correct?Grade the model with the following rubric:- (1) point for each of the 4 criteria that the model scored correctly- (3) points for each aspect of the model's reasoning that is meets the criteria- (3) points for the model's final decision to audit or not auditThe total score is the sum of the points, and should be between 0 and 10 inclusive."""model_judgement_graders = [ {"name": "Audit Reasoning Quality","type": "score_model","model": "o4-mini","input": [{"role": "system", "content": reasoning_eval_prompt}],"range": [0, 10],"pass_threshold": 8, },]full_eval =await create_eval("Full Receipt Processing Evaluation", simple_extraction_graders+ item_extraction_graders+ simple_audit_graders+ model_judgement_graders,)eval_run =await client.evals.runs.create(name="complete-receipt-processing-run",eval_id=full_eval.id,data_source={"type": "jsonl","source": {"type": "file_content", "content": file_content}, },)eval_run.report_url