No Title

This cookbook provides a practical, end-to-end guide on how to effectively useevals as the core process in creating a production-grade autonomous system toreplace a labor-intensive human workflow. It's a direct product of collaborativeexperience dealing with projects where users may not have started with pristinelabeled data or a perfect understanding of the problem - two issues that most tutorials glossover but are in practice almost always serious challenges.

Making evals the core process prevents poke-and-hope guesswork and impressionisticjudgments of accuracy, instead demanding engineering rigor. This means we can makeprincipled decisions about cost trade-offs and investment.

This guide is designed for ML/AI engineers and Solution Architects who arelooking for practical guidance beyond introductory tutorials. This notebook is fullyexecutable and organized to be as modular as possible to support using codesamples directly in your own applications.

We'll follow a realistic storyline: replacing a manual receipt-analysis service for validating expenses.

Start Small:

Build Incrementally:

Business Alignment:

Eval-Driven Iteration:

This cookbook is structured as an eval-centric guide through the lifecycle of buildingan LLM application.

If you're primarily interested in the ideas presented, read through the text and skim overthe code.If you're here because of something else you're working on, you can go ahead and jump to thatsection and dig into the code there, copy it, and adapt it to your needs.If you want to really understand how this all works, download this notebook and runthe cells as you read through it; edit the code to make your own changes, test yourhypotheses, and make sure you actually understand how it all works together.

Note: If your OpenAI organization has a Zero Data Retention (ZDR) policy, Evals dashboards and logs will not be available, since prompts and responses are not stored. This may limit visibility into eval results for compliance-focused enterprise accounts.

In order to condense this guide we'll be using a small hypothetical problem that's still complexenough to merit detailed and multi-faceted evals. In particular, we'll be focused on howto solve a problem given a limited amount of data to work with, so we're working with adataset that's quite small.

For this guide, we assume that we are starting with a workflow for reviewing and filingreceipts. While in general, this is a problem that already has a lot of establishedsolutions, it's analogous to other problems that don't have nearly so much prior work;further, even when good enterprise solutions exist there is often still a"last mile" problem that still requires human time.

In our case, we'll assume we have a pipeline where:

People upload photos of receiptsAn accounting team reviews each receipt to categorize and approve or audit the expense

Based on interviews with the accounting team, they make their decisions based on

MerchantGeographic locationExpense amountItems or services purchasedHandwritten notes or annotations

Our system will be expected to handle most receipts without any human intervention, butescalate low-confidence decisions for human QA. We'll be focused on reducing the totalcost of the accounting process, which is dependent on

How much the previous / current system cost to run per-receiptHow many receipts the new system sends to QAHow much the system costs to run per-receipt, plus any fixed costsWhat the business impact is of mistakes, either receipts kicked out for review or mistakes missedThe cost of engineering to develop and integrate the system

The receipt images come from the CC by 4.0 licensedReceipt Handwriting Detection Computer Vision Projectdataset published by Roboflow. We've added our own labels and narrative spin in order totell a story with a small number of examples.

Not every project will proceed in the same way, but projects generally have someimportant components in common.

The solid arrows show the primary progressions or steps, while the dotted linerepresents the ongoing nature of problem understanding - uncovering more aboutthe customer domain will influence every step of the process. We wil examineseveral of these iterative cycles of refinement in detail below.Not every project will proceed in the same way, but projects generally have some commonimportant components.

Usually, the decision to start an engineering process is made by leadership whounderstand the business impact but don't need to know the process details. In ourexample, we're building a system designed to replace a non-AI workflow. In a sense thisis ideal: we have a set of domain experts, the people currently doing the task who wecan interview to understand the task details and who we can lean upon to help developappropriate evals.

This step doesn't end before we start building our system; invariably, our initialassessments are an incomplete understanding of the problem space and we will continue torefine our understanding as we get closer to a solution.

It's very rare for a real-world project to begin with all the data necessary to achieve a satisfactory solution, let alone establish confidence.

In our case, we'll assume we have a decent sample of system inputs, in the form of but receipt images, but start without any fully annotated data. We find this is a not-unusual situation when automating an existing process. We'll walk through the process of incrementally expanding our test and training sets in collaboration with domain experts as we go along and make our evals progressively more comprehensive.

We want to get the skeleton of a system built as quickly as possible. We don't need asystem that performs well - we just need something that accepts the right inputs andprovides outputs of the correct type. Usually this is almost as simple as describing thetask in a prompt, adding the inputs, and using a single model (usually with structuredoutputs) to make an initial best-effort attempt.

We've found that in the absence of an established ground truth, it's not uncommon touse an early version of a system to generate 'draft' truth data which can be annotatedor corrected by domain experts.

Once we have an end-to-end system constructed, we can start processing the inputs wehave to generate plausible outputs. We'll send these to our domain experts to gradeand correct. We will use these corrections and conversations about how the expertsare making their decisions to design further evals and to embed expertise in the system.

Before we jump into correcting every error, we need to make sure that we're investingtime effectively. The most critical task at this stage is to review our evals andgain an understanding of how they connect to our key objectives.

Step back and assess the potential costs and benefits of the systemIdentify which eval measurements speak directly to those costs and benefitsFor example, what does "failure" on a particular eval cost? Are we measuringsomething worthwhile?Create a (non-LLM) model that uses eval metrics to provide a dollar valueBalance performance (accuracy, or speed) with cost to develop and run

Having identified which efforts are most worth making, we can begin iterating onimprovements to the system. The evals act as an objective guide so we know when we'vemade the system good enough, and ensure we avoid or identify regression.

Evals aren't just for development. Instrumenting all or a portion of a productionservice will surface more useful test and training samples over time, identifyingincorrect assumptions or finding areas with insufficient coverage. This is also the onlyway you can ensure that your models continue performing well long after your initialdevelopment process is complete.

In practice, we would probably be building a system that operates via a REST API,possibly with some web frontend that would have access to some set of components andresources. For the purposes of this cookbook, we'll distill that down to a pair offunctions, extract_receipt_details and evaluate_receipt_for_audit that collectivelydecide what we should do with a given receipt.

extract_receipt_details

evaluate_receipt_for_audit

Breaking up a process into steps like this has both pros and cons; it is easier toexamine and develop if the process is made up of small isolated steps. But you canprogressively lose information, effectively letting your agents play "telephone". Inthis notebook we break up the steps and don't let the auditor see the actual receiptbecause it's more instructive for the evals we want to discuss.

We'll start with the first step, the literal data extraction. This is intermediatedata: it's information that people would examine implicitly, but often isn't recorded.And for this reason, we often don't have labeled data to work from.

%pip install --upgrade openai pydantic python-dotenv rich persist-cache -qqq%load_ext dotenv%dotenv# Place your API key in a file called .env# OPENAI_API_KEY=sk-...

Capture the meaningful information in a structured output.

from pydantic import BaseModelclass Location(BaseModel):    city: str | None    state: str | None    zipcode: str | Noneclass LineItem(BaseModel):    description: str | None    product_code: str | None    category: str | None    item_price: str | None    sale_price: str | None    quantity: str | None    total: str | Noneclass ReceiptDetails(BaseModel):    merchant: str | None    location: Location    time: str | None    items: list[LineItem]    subtotal: str | None    tax: str | None    total: str | None    handwritten_notes: list[str]

Note: Normally we would use decimal.Decimal objects for the numbers above and datetime.datetime objects for time field, but neither of those deserialize well. For the purposes of this cookbook, we'll work with strings, but in practice you'd want to have another level of translation to get the correct output validated.

Let's build our extract_receipt_details function.

Usually, for the very first stab at something that might work, we'll simply feed ChatGPTthe available documents we've assembled so far and ask it to generate a prompt. It's notworth spending too much time on prompt engineering before you have a benchmark to gradeyourself against! This is a prompt produced by o4-mini based on the problem descriptionabove.

BASIC_PROMPT = """Given an image of a retail receipt, extract all relevant information and format it as a structured response.# Task DescriptionCarefully examine the receipt image and identify the following key information:1. Merchant name and any relevant store identification2. Location information (city, state, ZIP code)3. Date and time of purchase4. All purchased items with their:   * Item description/name   * Item code/SKU (if present)   * Category (infer from context if not explicit)   * Regular price per item (if available)   * Sale price per item (if discounted)   * Quantity purchased   * Total price for the line item5. Financial summary:   * Subtotal before tax   * Tax amount   * Final total6. Any handwritten notes or annotations on the receipt (list each separately)## Important Guidelines* If information is unclear or missing, return null for that field* Format dates as ISO format (YYYY-MM-DDTHH:MM:SS)* Format all monetary values as decimal numbers* Distinguish between printed text and handwritten notes* Be precise with amounts and totals* For ambiguous items, use your best judgment based on contextYour response should be structured and complete, capturing all available informationfrom the receipt."""

import base64import mimetypesfrom pathlib import Pathfrom openai import AsyncOpenAIclient = AsyncOpenAI()async def extract_receipt_details(    image_path: str, model: str = "o4-mini") -> ReceiptDetails:    """Extract structured details from a receipt image."""    # Determine image type for data URI.    mime_type, _ = mimetypes.guess_type(image_path)    # Read and base64 encode the image.    b64_image = base64.b64encode(Path(image_path).read_bytes()).decode("utf-8")    image_data_url = f"data:{mime_type};base64,{b64_image}"    response = await client.responses.parse(        model=model,        input=[            {                "role": "user",                "content": [                    {"type": "input_text", "text": BASIC_PROMPT},                    {"type": "input_image", "image_url": image_data_url},                ],            }        ],        text_format=ReceiptDetails,    )    return response.output_parsed

Let's evaluate just a single receipt and review it manually to see how well a smart model with a naive prompt can do.

from rich import printreceipt_image_dir = Path("data/test")ground_truth_dir = Path("data/ground_truth")example_receipt = Path(    "data/train/Supplies_20240322_220858_Raven_Scan_3_jpeg.rf.50852940734939c8838819d7795e1756.jpg")result = await extract_receipt_details(example_receipt)

We'll get different answers if we re-run it, but it usually gets most things correctwith a few errors. Here's a specific example:

walmart_receipt = ReceiptDetails(    merchant="Walmart",    location=Location(city="Vista", state="CA", zipcode="92083"),    time="2023-06-30T16:40:45",    items=[        LineItem(            description="SPRAY 90",            product_code="001920056201",            category=None,            item_price=None,            sale_price=None,            quantity="2",            total="28.28",        ),        LineItem(            description="LINT ROLLER 70",            product_code="007098200355",            category=None,            item_price=None,            sale_price=None,            quantity="1",            total="6.67",        ),        LineItem(            description="SCRUBBER",            product_code="003444193232",            category=None,            item_price=None,            sale_price=None,            quantity="2",            total="12.70",        ),        LineItem(            description="FLOUR SACK 10",            product_code="003444194263",            category=None,            item_price=None,            sale_price=None,            quantity="1",            total="0.77",        ),    ],    subtotal="50.77",    tax="4.19",    total="54.96",    handwritten_notes=[],)

The model extracted a lot of things correctly, but renamed some of the lineitems - incorrectly, in fact. More importantly, it got some of the prices wrong, and itdecided not to categorize any of the line items.

That's okay, we don't expect to have perfect answers at this point! Instead, ourobjective is to build a basic system we can evaluate. Then, when we start iterating, wewon't be 'vibing' our way to something that looks better -- we'll be engineering areliable solution. But first, we'll add an action decision to complete our draft system.

Next, we need to close the loop and get to an actual decision based on receipts. Thislooks pretty similar, so we'll present the code without comment.

Ordinarily one would start with the most capable model - o3, at this time - for afirst pass, and then once correctness is established experiment with different modelsto analyze any tradeoffs for their business impact, and potentially consider whetherthey are remediable with iteration. A client may be willing to take a certain accuracyhit for lower latency or cost, or it may be more effective to change the architectureto hit cost, latency, and accuracy goals. We'll get into how to make these tradeoffsexplicitly and objectively later on.

For this cookbook, o3 might be too good. We'll use o4-mini for our first pass, sothat we get a few reasoning errors we can use to illustrate the means of addressingthem when they occur.

Next, we need to close the loop and get to an actual decision based on receipts. Thislooks pretty similar, so we'll present the code without comment.

from pydantic import BaseModel, Fieldaudit_prompt = """Evaluate this receipt data to determine if it need to be audited based on the followingcriteria:1. NOT_TRAVEL_RELATED:   - IMPORTANT: For this criterion, travel-related expenses include but are not limited   to: gas, hotel, airfare, or car rental.   - If the receipt IS for a travel-related expense, set this to FALSE.   - If the receipt is NOT for a travel-related expense (like office supplies), set this   to TRUE.   - In other words, if the receipt shows FUEL/GAS, this would be FALSE because gas IS   travel-related.2. AMOUNT_OVER_LIMIT: The total amount exceeds $503. MATH_ERROR: The math for computing the total doesn't add up (line items don't sum to   total)4. HANDWRITTEN_X: There is an "X" in the handwritten notesFor each criterion, determine if it is violated (true) or not (false). Provide yourreasoning for each decision, and make a final determination on whether the receipt needsauditing. A receipt needs auditing if ANY of the criteria are violated.Return a structured response with your evaluation."""class AuditDecision(BaseModel):    not_travel_related: bool = Field(        description="True if the receipt is not travel-related"    )    amount_over_limit: bool = Field(description="True if the total amount exceeds $50")    math_error: bool = Field(description="True if there are math errors in the receipt")    handwritten_x: bool = Field(        description="True if there is an 'X' in the handwritten notes"    )    reasoning: str = Field(description="Explanation for the audit decision")    needs_audit: bool = Field(        description="Final determination if receipt needs auditing"    )async def evaluate_receipt_for_audit(    receipt_details: ReceiptDetails, model: str = "o4-mini") -> AuditDecision:    """Determine if a receipt needs to be audited based on defined criteria."""    # Convert receipt details to JSON for the prompt    receipt_json = receipt_details.model_dump_json(indent=2)    response = await client.responses.parse(        model=model,        input=[            {                "role": "user",                "content": [                    {"type": "input_text", "text": audit_prompt},                    {"type": "input_text", "text": f"Receipt details:\n{receipt_json}"},                ],            }        ],        text_format=AuditDecision,    )    return response.output_parsed

A schematic of the https://cookbook.openai.com../../../images/partner_process_flowchart.pngimg src="../../../images/partner_process_flowchart.png" alt="Process Flowchart">

If we run our above example through this model, here's what we get -- again, we'll usean example result here. When you run the code you might get slightly different results.

audit_decision = await evaluate_receipt_for_audit(result)print(audit_decision)

audit_decision = AuditDecision(    not_travel_related=True,    amount_over_limit=True,    math_error=False,    handwritten_x=False,    reasoning="""    The receipt from Walmart is for office supplies, which are not travel-related, thus NOT_TRAVEL_RELATED is TRUE.    The total amount of the receipt is $54.96, which exceeds the limit of $50, making AMOUNT_OVER_LIMIT TRUE.    The subtotal ($50.77) plus tax ($4.19) correctly sums to the total ($54.96), so there is no MATH_ERROR.    There are no handwritten notes, so HANDWRITTEN_X is FALSE.    Since two criteria (amount over limit and travel-related) are violated, the receipt    needs auditing.    """,    needs_audit=True,)

This example illustrates why we care about end-to-end evals and why we can't use them inisolation. Here, the initial extraction had OCR errors and forwarded the prices to theauditor that don't add up to the total, but the auditor fails to detect it and assertsthere are no math errors. However, missing this doesn't change the audit decisionbecause it did pick up on the other two reasons the receipt needs to be audited.

Thus, AuditDecision is factually incorrect, but the decision that we care aboutis correct. This gives us an edge to improve upon, but also guides us toward makingsound choices for where and when we apply our engineering efforts.

With that said, let's build ourselves some evals!

Once we have a minimally functional system we should process more inputs and get domainexperts to help develop ground-truth data. Domain experts doing expert tasks may nothave much time to devote to our project, so we want to be efficient and start small,aiming for breadth rather than depth at first.

If your data doesn't require domain expertise, then you'd want to reach for alabeling solution (such as Label Studio) and attempt to annotateas much data as you can given the policy, budget, and data availability restrictions.In this case, we're going to proceed as if data labeling is a scarce resource; one wecan rely on for small amounts each week, but these are people with other jobresponsibilities whose time and willingness to help may be limited. Sitting with theseexperts to help annotate examples can help make selecting future examples moreefficient.

Because we have a chain of two steps, we'll be collecting tuples of type[FilePath, ReceiptDetails, AuditDecision]. Generally, the way to do this is to takeunlabeled samples, run them through our model, and then have experts correct the output.For the purposes of this notebook, we've already gone through that process for all thereceipt images in data/test.

There's a little more to it than that though, because when you are evaluating amultistep process it's important to know both the end to end performance and theperformance of each individual step, conditioned on the output of the prior step.

In this case, we want to evaluate:

judgement

successful

The phrasing difference between #2 and #3 is because if we give our auditor incorrectdata, we expect it to come to incorrect conclusions. What we want is to be confidentthat the auditor is making the correct decision based on the evidence available, even ifthat evidence is misleading. If we don't pay attention to that case, we can end uptraining the auditor to ignore its inputs and cause our overall performance to degrade.

The core component of an eval is thegrader. Our eventual eval is going touse 18 of them, but we only use three kinds, and they're all quite conceptuallystraightforward.

Here are examples of one of our string check graders, one of our text similaritygraders, and finally one of our model graders.

example_graders = [    {        "name": "Total Amount Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_receipt_details.total }}",        "reference": "{{ item.correct_receipt_details.total }}",    },    {        "name": "Merchant Name Accuracy",        "type": "text_similarity",        "input": "{{ item.predicted_receipt_details.merchant }}",        "reference": "{{ item.correct_receipt_details.merchant }}",        "pass_threshold": 0.8,        "evaluation_metric": "bleu",    },]# A model grader needs a prompt to instruct it in what it should be scoring.missed_items_grader_prompt = """Your task is to evaluate the correctness of a receipt extraction model.The following items are the actual (correct) line items from a specific receipt.{{ item.correct_receipt_details.items }}The following items are the line items extracted by the model.{{ item.predicted_receipt_details.items }}Score 0 if the sample evaluation missed any items from the receipt; otherwise score 1.The line items are permitted to have small differences or extraction mistakes, but eachitem from the actual receipt must be present in some form in the model's output. Onlyevaluate whether there are MISSED items; ignore other mistakes or extra items."""example_graders.append(    {        "name": "Missed Line Items",        "type": "score_model",        "model": "o4-mini",        "input": [{"role": "system", "content": missed_items_grader_prompt}],        "range": [0, 1],        "pass_threshold": 1,    })

Each grader evaluates some portion of a predicted output. This might be a very narrowcheck for a specific field in a structured output, or a more holistic check thatjudges an output in its entirety. Some graders can work without context, and evaluate anoutput in isolation (for example, an LLM judge that is evaluating if a paragraph is rudeor inappropriate). Others can evaluate based on the input and output, while while theones we're using here rely on an output and a ground-truth (correct) output to compareagainst.

The most direct way of using Evals provides a prompt and a model, and lets the eval runon an input to generate output itself. Another useful method uses previously loggedresponses or completions as the source of the outputs. It's not quite as simple, but themost flexible thing we can do is to supply an item containing everything we want it touse—this allows us to have the "prediction" function be an arbitrary system rather thanrestricting it to a single model call. This is how we're using it in the examples below;the EvaluationRecord shown below will be used to populate the {{ }} templatevariables.

Note on Model Selection:
Selecting the right model is crucial. While faster, less expensive models are often preferable in production, development workflows benefit from prioritizing the most capable models available. For this guide, we use o4-mini for both system tasks and LLM-based grading—while o3 is more capable, our experience suggests the difference in output quality is modest relative to the substantial increase in cost. In practice, spending $10+/day/engineer on evals is typical, but scaling to $100+/day/engineer may not be sustainable.
Nonetheless, it's valuable to periodically benchmark with a more advanced model like o3. If you observe significant improvements, consider incorporating it for a representative subset of your evaluation data. Discrepancies between models can reveal important edge cases and guide system improvements.

import asyncioclass EvaluationRecord(BaseModel):    """Holds both the correct (ground truth) and predicted audit decisions."""    receipt_image_path: str    correct_receipt_details: ReceiptDetails    predicted_receipt_details: ReceiptDetails    correct_audit_decision: AuditDecision    predicted_audit_decision: AuditDecisionasync def create_evaluation_record(image_path: Path, model: str) -> EvaluationRecord:    """Create a ground truth record for a receipt image."""    extraction_path = ground_truth_dir / "extraction" / f"{image_path.stem}.json"    correct_details = ReceiptDetails.model_validate_json(extraction_path.read_text())    predicted_details = await extract_receipt_details(image_path, model)    audit_path = ground_truth_dir / "audit_results" / f"{image_path.stem}.json"    correct_audit = AuditDecision.model_validate_json(audit_path.read_text())    predicted_audit = await evaluate_receipt_for_audit(predicted_details, model)    return EvaluationRecord(        receipt_image_path=image_path.name,        correct_receipt_details=correct_details,        predicted_receipt_details=predicted_details,        correct_audit_decision=correct_audit,        predicted_audit_decision=predicted_audit,    )async def create_dataset_content(    receipt_image_dir: Path, model: str = "o4-mini") -> list[dict]:    # Assemble paired samples of ground truth data and predicted results. You could    # instead upload this data as a file and pass a file id when you run the eval.    tasks = [        create_evaluation_record(image_path, model)        for image_path in receipt_image_dir.glob("*.jpg")    ]    return [{"item": record.model_dump()} for record in await asyncio.gather(*tasks)]file_content = await create_dataset_content(receipt_image_dir)

Once we have the graders and the data, creating and running our evals is very straightforward:

from persist_cache import cache# We're caching the output so that if we re-run this cell we don't create a new eval.@cacheasync def create_eval(name: str, graders: list[dict]):    eval_cfg = await client.evals.create(        name=name,        data_source_config={            "type": "custom",            "item_schema": EvaluationRecord.model_json_schema(),            "include_sample_schema": False,  # Don't generate new completions.        },        testing_criteria=graders,    )    print(f"Created new eval: {eval_cfg.id}")    return eval_cfginitial_eval = await create_eval(    "Initial Receipt Processing Evaluation", example_graders)# Run the eval.eval_run = await client.evals.runs.create(    name="initial-receipt-processing-run",    eval_id=initial_eval.id,    data_source={        "type": "jsonl",        "source": {"type": "file_content", "content": file_content},    },)print(f"Evaluation run created: {eval_run.id}")print(f"View results at: {eval_run.report_url}")

After you run that eval you'll be able to view it in the UI, and should see somethinglike the below.

(Note, if you have a Zero-Data-Retention agreement, this data is not storehttps://cookbook.openai.com../../../images/partner_summary_ui.pngin this interface.)like:

Evals show you where you can improve, and help track progress and regressions over time.But the three evals above are just measurements — we need to imbue them with raisond'être.

The first thing we need is to add evaluations for the final stage of our receiptprocessing, so that we can start seeing the results of our audit decisions. The nextthing we need, the most important, is a model of business relevance.

It's almost never easy to work out what costs and benefits you could get out of a newsystem depending on how well it performs. Often people will avoid trying to putnumbers to things because they know how much uncertainty there is and they don't want tomake guesses that make them look bad. That's okay; we just have to make our best guess,and if we get more information later we can refine our model.

For this cookbook, we're going to create a simple cost structure:

identifies receipts that need to be audited 97% of the timemisidentifies receipts that don't need to be audited 2% of the time

This gives us two baseline comparisons:

if we identified every receipt correctly, we would spend $100,000 on auditsour current process spends $135,000 on audits and loses $45,000 to un-audited expenses

On top of that, the human-driven process costs an additional $200,000.

We're expecting our service to save money by costing less to run (≈1¢/receipt if we usethe prompts from above with o4-mini), but whether we save or lose money on audits andmissed audits depends on how well our system performs. It might be worth writing this asa simple function — written below is a version that includes the above factors butneglects nuance and ignores development, maintenance, and serving costs.

def calculate_costs(fp_rate: float, fn_rate: float, per_receipt_cost: float):    audit_cost = 2    missed_audit_cost = 30    receipt_count = 1e6    audit_fraction = 0.05    needs_audit_count = receipt_count * audit_fraction    no_needs_audit_count = receipt_count - needs_audit_count    missed_audits = needs_audit_count * fn_rate    total_audits = needs_audit_count * (1 - fn_rate) + no_needs_audit_count * fp_rate    audit_cost = total_audits * audit_cost    missed_audit_cost = missed_audits * missed_audit_cost    processing_cost = receipt_count * per_receipt_cost    return audit_cost + missed_audit_cost + processing_costperfect_system_cost = calculate_costs(0, 0, 0)current_system_cost = calculate_costs(0.02, 0.03, 0.20)print(f"Current system cost: ${current_system_cost:,.0f}")

The point of the above model is it lets us apply meaning to an eval that wouldotherwise just be a number. For instance, when we ran the system above we were wrong 85%of the time for merchant names. But digging in, it seems like most instances arecapitalization issues or "Shell Gasoline" vs. "Shell Oil #2144" — problems that whenwe follow through, do not appear to affect our audit decision or change our fundamentalcosts.

On the other hand, it seems like we fail to catch handwritten "X"s on receipts abouthalf the time, and about half of the time when there's an "X" on a receipt that getsmissed, it results in a receipt not getting audited when it should. Those areoverrepresented in our dataset, but if that makes up even 1% of receipts, that 50%failure would cost us $75,000 a year.

Similarly, it seems like we have OCR errors that cause us to audit receipts quite oftenon account of the math not working out, up to 20% of the time. This could cost us almost$400,000!

Now, we're in a place to add more graders and start working backwards from the auditdecision accuracy to determine which problems we should focus on.

Below are the rest of our graders and the results we get with our initial un-optimizedprompts. Note that at this point we do quite badly! Across our 20 samples (8 positive,12 negative), we had two false negatives and two false positives. If we extrapolated toour entire business, we'd be losing $375,000 on audits we missed and $475,000 onunnecessary audits.

simple_extraction_graders = [    {        "name": "Merchant Name Accuracy",        "type": "text_similarity",        "input": "{{ item.predicted_receipt_details.merchant }}",        "reference": "{{ item.correct_receipt_details.merchant }}",        "pass_threshold": 0.8,        "evaluation_metric": "bleu",    },    {        "name": "Location City Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_receipt_details.location.city }}",        "reference": "{{ item.correct_receipt_details.location.city }}",    },    {        "name": "Location State Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_receipt_details.location.state }}",        "reference": "{{ item.correct_receipt_details.location.state }}",    },    {        "name": "Location Zipcode Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_receipt_details.location.zipcode }}",        "reference": "{{ item.correct_receipt_details.location.zipcode }}",    },    {        "name": "Time Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_receipt_details.time }}",        "reference": "{{ item.correct_receipt_details.time }}",    },    {        "name": "Subtotal Amount Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_receipt_details.subtotal }}",        "reference": "{{ item.correct_receipt_details.subtotal }}",    },    {        "name": "Tax Amount Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_receipt_details.tax }}",        "reference": "{{ item.correct_receipt_details.tax }}",    },    {        "name": "Total Amount Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_receipt_details.total }}",        "reference": "{{ item.correct_receipt_details.total }}",    },    {        "name": "Handwritten Notes Accuracy",        "type": "text_similarity",        "input": "{{ item.predicted_receipt_details.handwritten_notes }}",        "reference": "{{ item.correct_receipt_details.handwritten_notes }}",        "pass_threshold": 0.8,        "evaluation_metric": "fuzzy_match",    },]item_extraction_base = """Your task is to evaluate the correctness of a receipt extraction model.The following items are the actual (correct) line items from a specific receipt.{{ item.correct_receipt_details.items }}The following items are the line items extracted by the model.{{ item.predicted_receipt_details.items }}"""missed_items_instructions = """Score 0 if the sample evaluation missed any items from the receipt; otherwise score 1.The line items are permitted to have small differences or extraction mistakes, but eachitem from the actual receipt must be present in some form in the model's output. Onlyevaluate whether there are MISSED items; ignore other mistakes or extra items."""extra_items_instructions = """Score 0 if the sample evaluation extracted any extra items from the receipt; otherwisescore 1.The line items are permitted to have small differences or extraction mistakes, but eachitem from the actual receipt must be present in some form in the model's output. Onlyevaluate whether there are EXTRA items; ignore other mistakes or missed items."""item_mistakes_instructions = """Score 0 to 10 based on the number and severity of mistakes in the line items.A score of 10 means that the two lists are perfectly identical.Remove 1 point for each minor mistake (typos, capitalization, category namedifferences), and up to 3 points for significant mistakes (incorrect quantity, price, ortotal, or categories that are not at all similar)."""item_extraction_graders = [    {        "name": "Missed Line Items",        "type": "score_model",        "model": "o4-mini",        "input": [            {                "role": "system",                "content": item_extraction_base + missed_items_instructions,            }        ],        "range": [0, 1],        "pass_threshold": 1,    },    {        "name": "Extra Line Items",        "type": "score_model",        "model": "o4-mini",        "input": [            {                "role": "system",                "content": item_extraction_base + extra_items_instructions,            }        ],        "range": [0, 1],        "pass_threshold": 1,    },    {        "name": "Item Mistakes",        "type": "score_model",        "model": "o4-mini",        "input": [            {                "role": "system",                "content": item_extraction_base + item_mistakes_instructions,            }        ],        "range": [0, 10],        "pass_threshold": 8,    },]simple_audit_graders = [    {        "name": "Not Travel Related Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_audit_decision.not_travel_related }}",        "reference": "{{ item.correct_audit_decision.not_travel_related }}",    },    {        "name": "Amount Over Limit Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_audit_decision.amount_over_limit }}",        "reference": "{{ item.correct_audit_decision.amount_over_limit }}",    },    {        "name": "Math Error Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_audit_decision.math_error }}",        "reference": "{{ item.correct_audit_decision.math_error }}",    },    {        "name": "Handwritten X Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_audit_decision.handwritten_x }}",        "reference": "{{ item.correct_audit_decision.handwritten_x }}",    },    {        "name": "Needs Audit Accuracy",        "type": "string_check",        "operation": "eq",        "input": "{{ item.predicted_audit_decision.needs_audit }}",        "reference": "{{ item.correct_audit_decision.needs_audit }}",    },]reasoning_eval_prompt = """Your task is to evaluate the quality of *reasoning* for audit decisions on receipts.Here are the rules for audit decisions:Expenses should be audited if they violate any of the following criteria:1. Expenses must be travel-related2. Expenses must not exceed $503. All math should be correct; the line items plus tax should equal the total4. There must not be an "X" in the handwritten notesIf ANY of those criteria are violated, the expense should be audited.Here is the input to the grader:{{ item.predicted_receipt_details }}Below is the output of an authoritative grader making a decision about whether or not toaudit an expense. This is a correct reference decision.GROUND TRUTH:{{ item.correct_audit_decision }}Here is the output of the model we are evaluating:MODEL GENERATED:{{ item.predicted_audit_decision }}Evaluate:1. For each of the 4 criteria, did the model correctly score it as TRUE or FALSE?2. Based on the model's *scoring* of the criteria (regardless if it scored it   correctly), did the model reason appropriately about the criteria (i.e. did it   understand and apply the prompt correctly)?3. Is the model's reasoning logically sound, sufficient, and comprehensible?4. Is the model's reasoning concise, without extraneous details?5. Is the final decision to audit or not audit correct?Grade the model with the following rubric:- (1) point for each of the 4 criteria that the model scored correctly- (3) points for each aspect of the model's reasoning that is meets the criteria- (3) points for the model's final decision to audit or not auditThe total score is the sum of the points, and should be between 0 and 10 inclusive."""model_judgement_graders = [    {        "name": "Audit Reasoning Quality",        "type": "score_model",        "model": "o4-mini",        "input": [{"role": "system", "content": reasoning_eval_prompt}],        "range": [0, 10],        "pass_threshold": 8,    },]full_eval = await create_eval(    "Full Receipt Processing Evaluation",    simple_extraction_graders    + item_extraction_graders    + simple_audit_graders    + model_judgement_graders,)eval_run = await client.evals.runs.create(    name="complete-receipt-processing-run",    eval_id=full_eval.id,    data_source={        "type": "jsonl",        "source": {"type": "file_content", "content": file_content},    },)eval_run.report_url

nursery_receipt_details = ReceiptDetails(    merchant="WESTERN SIERRA NURSERY",    location=Location(city="Oakhurst", state="CA", zipcode="93644"),    time="2024-09-27T12:33:38",    items=[        LineItem(            description="Plantskydd Repellent RTU 1 Liter",            product_code=None,            category="Garden/Pest Control",            item_price="24.99",            sale_price=None,            quantity="1",            total="24.99",        )    ],    subtotal="24.99",    tax="1.94",    total="26.93",    handwritten_notes=[],)nursery_audit_decision = AuditDecision(    not_travel_related=True,    amount_over_limit=False,    math_error=False,    handwritten_x=False,    reasoning="""    1. The merchant is a plant nursery and the item purchased an insecticide, so this       purchase is not travel-related (criterion 1 violated).    2. The total is $26.93, under $50, so criterion 2 is not violated.    3. The line items (1 * $24.99 + $1.94 tax) sum to $26.93, so criterion 3 is not       violated.    4. There are no handwritten notes or 'X's, so criterion 4 is not violated.    Since NOT_TRAVEL_RELATED is true, the receipt must be audited.    """,    needs_audit=True,)flying_j_details = ReceiptDetails(    merchant="Flying J #616",    location=Location(city="Frazier Park", state="CA", zipcode=None),    time="2024-10-01T13:23:00",    items=[        LineItem(            description="Unleaded",            product_code=None,            category="Fuel",            item_price="4.459",            sale_price=None,            quantity="11.076",            total="49.39",        )    ],    subtotal="49.39",    tax=None,    total="49.39",    handwritten_notes=["yos -> home sequoia", "236660"],)flying_j_audit_decision = AuditDecision(    not_travel_related=False,    amount_over_limit=False,    math_error=False,    handwritten_x=False,    reasoning="""    1. The only item purchased is Unleaded gasoline, which is travel-related so       NOT_TRAVEL_RELATED is false.    2. The total is $49.39, which is under $50, so AMOUNT_OVER_LIMIT is false.    3. The line items ($4.459 * 11.076 = $49.387884) sum to the total of $49.39, so       MATH_ERROR is false.    4. There is no "X" in the handwritten notes, so HANDWRITTEN_X is false.    Since none of the criteria are violated, the receipt does not need auditing.    """,    needs_audit=False,)engine_oil_details = ReceiptDetails(    merchant="O'Reilly Auto Parts",    location=Location(city="Sylmar", state="CA", zipcode="91342"),    time="2024-04-26T8:43:11",    items=[        LineItem(            description="VAL 5W-20",            product_code=None,            category="Auto",            item_price="12.28",            sale_price=None,            quantity="1",            total="12.28",        )    ],    subtotal="12.28",    tax="1.07",    total="13.35",    handwritten_notes=["vista -> yos"],)engine_oil_audit_decision = AuditDecision(    not_travel_related=False,    amount_over_limit=False,    math_error=False,    handwritten_x=False,    reasoning="""    1. The only item purchased is engine oil, which might be required for a vehicle       while traveling, so NOT_TRAVEL_RELATED is false.    2. The total is $13.35, which is under $50, so AMOUNT_OVER_LIMIT is false.    3. The line items ($12.28 + $1.07 tax) sum to the total of $13.35, so       MATH_ERROR is false.    4. There is no "X" in the handwritten notes, so HANDWRITTEN_X is false.    None of the criteria are violated so the receipt does not need to be audited.    """,    needs_audit=False,)examples = [    {"input": nursery_receipt_details, "output": nursery_audit_decision},    {"input": flying_j_details, "output": flying_j_audit_decision},    {"input": engine_oil_details, "output": engine_oil_audit_decision},]# Format the examples as JSON, with each example wrapped in XML tags.example_format = """<example>    <input>        {input}    </input>    <output>        {output}    </output></example>"""examples_string = ""for example in examples:    example_input = example["input"].model_dump_json()    correct_output = example["output"].model_dump_json()    examples_string += example_format.format(input=example_input, output=correct_output)audit_prompt = f"""Evaluate this receipt data to determine if it need to be audited based on the followingcriteria:1. NOT_TRAVEL_RELATED:   - IMPORTANT: For this criterion, travel-related expenses include but are not limited   to: gas, hotel, airfare, or car rental.   - If the receipt IS for a travel-related expense, set this to FALSE.   - If the receipt is NOT for a travel-related expense (like office supplies), set this   to TRUE.   - In other words, if the receipt shows FUEL/GAS, this would be FALSE because gas IS   travel-related.   - Travel-related expenses include anything that could be reasonably required for   business-related travel activities. For instance, an employee using a personal   vehicle might need to change their oil; if the receipt is for an oil change or the   purchase of oil from an auto parts store, this would be acceptable and counts as a   travel-related expense.2. AMOUNT_OVER_LIMIT: The total amount exceeds $503. MATH_ERROR: The math for computing the total doesn't add up (line items don't sum to   total)   - Add up the price and quantity of each line item to get the subtotal   - Add tax to the subtotal to get the total   - If the total doesn't match the amount on the receipt, this is a math error   - If the total is off by no more than $0.01, this is NOT a math error4. HANDWRITTEN_X: There is an "X" in the handwritten notesFor each criterion, determine if it is violated (true) or not (false). Provide yourreasoning for each decision, and make a final determination on whether the receipt needsauditing. A receipt needs auditing if ANY of the criteria are violated.Note that violation of a criterion means that it is `true`. If any of the above fourvalues are `true`, then the receipt needs auditing (`needs_audit` should be `true`: itfunctions as a boolean OR over all four criteria).If the receipt contains non-travel expenses, then NOT_TRAVEL_RELATED should be `true`and therefore NEEDS_AUDIT must also be set to `true`. IF THE RECEIPT LISTS ITEMS THATARE NOT TRAVEL-RELATED, THEN IT MUST BE AUDITED. Here are some example inputs todemonstrate how you should act:<examples>{examples_string}</examples>Return a structured response with your evaluation."""

The modifications we made to the prompt above are:

Under item 1 concerning travel-related expenses, we added a bullet point

- Travel-related expenses include anything that could be reasonably required for  business-related travel activities. For instance, an employee using a personal  vehicle might need to change their oil; if the receipt is for an oil change or the  purchase of oil from an auto parts store, this would be acceptable and counts as a  travel-related expense.

We added more proscriptive guidance on how to evaluate for a math error.Specifically, we added the bullet points:

   - Add up the price and quantity of each line item to get the subtotal   - Add tax to the subtotal to get the total   - If the total doesn't match the amount on the receipt, this is a math error   - If the total is off by no more than $0.01, this is NOT a math error

This doesn't actually have to do with the issues we mentioned, but is another issuewe noticed as a flaw in the reasoning provided by the audit model.

We added very strong guidance (we actually needed to state it and restate itemphatically) to say that non-travel-related expenses should be audited.

Note that violation of a criterion means that it is `true`. If any of the above fourvalues are `true`, then the receipt needs auditing (`needs_audit` should be `true`: itfunctions as a boolean OR over all four criteria).If the receipt contains non-travel expenses, then NOT_TRAVEL_RELATED should be `true`and therefore NEEDS_AUDIT must also be set to `true`. IF THE RECEIPT LISTS ITEMS THATARE NOT TRAVEL-RELATED, THEN IT MUST BE AUDITED.

We added three examples, JSON input/output pairs wrapped in XML tags.We added three examples, JSON input/output pairs wrapped in XML tags.

With our prompt revisions, we'll regenerate the data to evaluate and re-run the sameeval to compare our results:

file_content = await create_dataset_content(receipt_image_dir)eval_run = await client.evals.runs.create(    name="updated-receipt-processing-run",    eval_id=full_eval.id,    data_source={        "type": "jsonl",        "source": {"type": "file_content", "content": file_content},    },)eval_run.report_url

When we ran the eval again, we actually still got two audit decisions wrong. Digging intothe examples we made a mistake on, it turns out that we completely fixed the issues weidentified, but our examples improved the reasoning step and caused two other issues tosurface. Specifically:

One receipt needed to be audited only because there was a mistake in extraction anda handwritten "X" wasn't identified. The audit model reasoned correctly, but based onincorrect data.One receipt was extracted in such a way that a $0.35 debit fee wasn't visible, so theaudit model identified a math error. This almost certainly happened because weprovided it with more detailed instructions and clear examples that demonstrated itneeded to actually add up all the line items in order to decide whether there was amath error. Again, this demonstrates correct behavior on the part of the audit modeland suggests we need to correct the extraction model.

This is great, and we'll continue iterating on issues as we uncover them. This is thecycle of improvement!

When beginning a project, we usually start with one of the most capable models available, such as o4-mini, to establish a performance baseline. Once we’re confident in the model’s ability to solve the task, the next step is to explore smaller, faster, or more cost-effective alternatives.

Optimizing for inference cost and latency is essential, especially for production or customer-facing systems, where these factors can significantly impact overall expenses and user experience. For instance, switching from o4-mini to gpt-4.1-mini could reduce inference costs by nearly two-thirds—an example where thoughtful model selection leads to meaningful savings.

In the next section, we’ll rerun our evaluations using gpt-4.1-mini for both extraction and audit steps to see how well a more efficient model performs.

file_content = await create_dataset_content(receipt_image_dir, model="gpt-4.1-mini")eval_run = await client.evals.runs.create(    name="receipt-processing-run-gpt-4-1-mini",    eval_id=full_eval.id,    data_source={        "type": "jsonl",        "source": {"type": "file_content", "content": file_content},    },)eval_run.report_url

The results are pretty promising. It doesn't look like https://cookbook.openai.comhttps://cookbook.openai.com../../../images/partner_eval_variations.png see one regression (the snowbroom again), but our audit decision is correcttwice as often as it was before our prompt changes.

This is great evidence that we'll be able to switch to a cheaper model, but it mightrequire more prompt engineering, fine-tuning, or some form of model-distillation. Notehowever that according to our current model this would already be saving us money. Wedon't quite believe that yet because we don't have a large enough sample — our realfalse negative rate will be more than the 0 we see here.

system_cost_4_1_mini = calculate_costs(    fp_rate=1 / 12, fn_rate=0, per_receipt_cost=0.003)print(f"Cost using gpt-4.1-mini: ${system_cost_4_1_mini:,.0f}")

This cookbook focuses on the philosophy and practicalities of evals, not the full range of model improvement techniques. For boosting or maintaining model performance (especially when moving to smaller, faster, or cheaper models), consider these steps in order—start from the top, andhttps://cookbook.openai.comhttps://cookbook.openai.com../../../images/partner_model_improvement_waterfall.pngize your prompt before resorting to fine-tuning; fine-tuning on a weak prompt can lock in bad performance even if you improve the prompt later.

Model selection:

Prompt tuning:

Examples and context:

Tools use:

Accessory models:

Fine-tuning:

The above options are all tools to maximize performance. Once you're trying to optimizefor a price:performance ratio, you'll usually have already done all of the above andlikely don't need to repeat most steps, but you can still fine-tune smaller models oruse your best model to train a smaller model (model distillation).

One really excellent thing about OpenAI Evals is that you can use the same graders forReinforcement Fine-Tuningto produce better model performance in an extremely sample-efficient manner. One noteof caution is to make sure that you use separate training data and don't leak youreval datasets during RFT.

Building and deploying an LLM application is just the beginning—the real value comes from ongoing improvement. Once your system is live, prioritize continuous monitoring: log traces, track outputs, and proactively sample real user interactions for human review using smart sampling techniques.

Production data is your most authentic source for evolving your evaluation and training datasets. Regularly collect and curate fresh samples from actual use cases to identify gaps, edge cases, and new opportunities for enhancement.

In practice, leverage this data for rapid iteration. Automate periodic fine-tuning pipelines that retrain your models on recent, high-quality samples and automatically deploy new versions when they outperform existing ones in your evals. Capture user corrections and feedback, then systematically feed these insights back into your prompts or retraining process—especially when they highlight persistent issues.

By embedding these feedback loops into your post-development workflow, you ensure your LLM applications continuously adapt, stay robust, and remain closely aligned with user needs as they evolve.

This cookbook serves as a joint collaboration effort between OpenAI and Fractional.

Hugh WimberlyJoshua MarkerEddie SiegelShikhar Kwatra

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签