Identifying the Best OCR API: Benchmarking OCR APIs on Real-World Documents

With the rapid advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs), many believe OCR has become obsolete. If LLMs can "see" and "read" documents, why not use them directly for text extraction?

The answer lies in reliability. Can you always be a 100% sure of the veracity of text output that LLMs interpret from a document/image? We put this to test with a simple experiment. We asked colleagues to use any LLM of their choice to extract a list of passenger names (10) from a sample PDF flight ticket.

The results were quite interesting - Claude 3 Opus couldn't read the PDF at all, Claude 3.5 Sonnet missed one passenger's name, the output from ChatGPT o3-mini was completely made up (100% hallucination), ChatPDF missed half of the passengers on the list.

Only NotebookLM and Deepseek got the list of names completely right!

While LLMs can interpret and summarize documents, they lack the precision and structured output required for critical business applications where 100% data accuracy is crucial. Additionally, LLMs require significant computational resources, making them costly and impractical for large-scale document processing, especially in enterprise and edge deployments.

OCR, on the other hand, is optimized for efficiency, running on low-power devices while delivering consistent results. When accuracy is non-negotiable whether in financial records, legal contracts, or regulatory compliance, OCR remains the most dependable solution.

Unlike LLMs, OCR APIs provide confidence scores and bounding boxes, allowing developers to detect uncertainties in extracted text. This level of control is crucial for businesses that cannot afford incorrect or hallucinated data. That’s why OCR APIs continue to be widely used in document automation workflows, AI-driven data extraction, and enterprise applications.

To assess the state of OCR in 2025, we benchmarked nine of the most popular OCR APIs, covering commercial solutions, open-source OCR engines, and document processing frameworks. Our goal is to provide an objective, data-driven comparison that helps developers and enterprises choose the best tool for their needs.

Methodology

Dataset Selection:

To ensure a comprehensive evaluation of OCR APIs or OCR models in real-world scenarios, we selected datasets that encompass a diverse range of document types and challenges commonly encountered in practical applications. Our dataset choices include:

Common Business Documents:

Receipts:

Low-Resolution Images:

Handwritten Text:

Blurred or Distorted Text:

Rotated or Skewed Text:

Tabular Data:

Dense Text:

To ensure our benchmark covers all these real-world challenges, we select the following datasets:

STROIE

link to dataset

FUNSD

link to dataset

These datasets provide a comprehensive testbed for evaluating OCR performance across practical and real life scenarios.

Models Selection

To evaluate OCR performance across different scenarios, we include a mix of commercial APIs, open-source OCR models, and document processing frameworks. This ensures a balanced comparison between proprietary solutions and freely available alternatives. The models used in our benchmark are:

Popular Commercial OCR APIs:

Google Cloud Vision AIAzure AI Document IntelligenceAmazon Textract

Popular Open-Source OCR APIs:

SuryaPaddleOCRRapidOCRExtractous

Popular Open-Source Document Processing Frameworks:

MarkerUnstructured-IO

To demonstrate how each OCR API processes an image, we provide code snippets for running OCR using both commercial APIs and open-source frameworks. These examples show how to load an image, apply OCR, and extract the text, offering a practical guide for implementation and comparison. Below are the code snippets for each model:

Google Cloud Vision AI:

vision-ocr-service

Required Packages:

pip install google-cloud-vision

from google.cloud import visionfrom google.oauth2 import service_accountcredentials = service_account.Credentials.from_service_account_file("/content/ocr-nanonets-cea4ddeb1dd2.json") #path to the json file downloadedclient = vision.ImageAnnotatorClient(credentials=credentials)def detect_text(image_path):   """Detects text in an image using Google Cloud Vision AI."""   with open(image_path, 'rb') as image_file:       content = image_file.read()   image = vision.Image(content=content)   response = client.text_detection(image=image)     texts = response.text_annotations   if texts:       return texts[0].description   else:       return "No text detected."   if response.error.message:       raise Exception(f"Error: {response.error.message}")# Replace with your image pathimage_path = "/content/drive/MyDrive/OCR_datasets/STROIE/test_data/img/X00016469670.jpg"print(detect_text(image_path))

Azure AI Document Intelligence

Azure Portal

Required Packages:

pip install azure-ai-documentintelligence

from azure.ai.documentintelligence import DocumentIntelligenceClientfrom azure.core.credentials import AzureKeyCredentialimport io# Replace with your Azure endpoint and API keyAZURE_ENDPOINT = "https://your-region.api.cognitive.microsoft.com/"AZURE_KEY = "your-api-key"client = DocumentIntelligenceClient(AZURE_ENDPOINT, AzureKeyCredential(AZURE_KEY))def extract_text(image_path):   """Extracts text from an image using Azure AI Document Intelligence."""   with open(image_path, "rb") as image_file:       image_data = image_file.read()          poller = client.begin_analyze_document("prebuilt-read", document=image_data)      result = poller.result()   extracted_text = []   for page in result.pages:       for line in page.lines:           extracted_text.append(line.content)   print("Detected text:")   print("\n".join(extracted_text))   image_path = image_pathextract_text(image_path)

Amazon Textract

AWS Sign-Up

textract-user

Required Packages:

pip install boto3

Set Environment Variables:

export AWS_ACCESS_KEY_ID="your-access-key"export AWS_SECRET_ACCESS_KEY="your-secret-key"export AWS_REGION="your-region"

import boto3textract = boto3.client("textract", region_name="us-east-1")def extract_text(image_path):   """Extracts text from an image using Amazon Textract."""   with open(image_path, "rb") as image_file:       image_bytes = image_file.read()   response = textract.detect_document_text(Document={"Bytes": image_bytes})   extracted_text = []   for item in response["Blocks"]:       if item["BlockType"] == "LINE":           extracted_text.append(item["Text"])   print("Detected text:")   print("\n".join(extracted_text))   image_path = image_pathextract_text(image_path)

Surya

from PIL import Imagefrom surya.recognition import RecognitionPredictorfrom surya.detection import DetectionPredictorimage = Image.open(image_path)langs = ["en"]recognition_predictor = RecognitionPredictor()detection_predictor = DetectionPredictor()predictions = recognition_predictor([image], [langs], detection_predictor)

PaddleOCR

from paddleocr import PaddleOCRocr = PaddleOCR(use_angle_cls=True, lang="en")result = ocr.ocr(image_path, cls=True)

RapidOCR

from rapidocr_onnxruntime import RapidOCRengine = RapidOCR()img_path = image_pathresult, elapse = engine(img_path)

Extractous

from extractous import Extractor, TesseractOcrConfigextractor = Extractor().set_ocr_config(TesseractOcrConfig().set_language("en"))result, metadata = extractor.extract_file_to_string(image_path)print(result)

Marker

!marker_single image_path --output_dir saving_directory --output_format json

Unstructured-IO

from unstructured.partition.auto import partitionelements = partition(filename=image_path)print("\n\n".join([str(el) for el in elements]))

Evaluation Metrics

To assess the effectiveness of each OCR model, we evaluate both accuracy and performance using the following metrics:

Character Error Rate

Word Error Rate

ROUGE Score

For performance evaluation, we measure:

Inference Time

Cost Evaluation:

commercial OCR APIs

open-source OCR APIs

memory usage

Benchmarking Results

Since the datasets used—STROIE (different receipt images) and FUNSD (business documents with tabular layouts)—contain diverse layout styles, the extracted text varies across models based on their ability to preserve structure. This variation affects the Word Error Rate (WER) and Character Error Rate (CER), as these metrics depend on the position of words and characters in the output.

A high error rate indicates that a model struggles to maintain the chronological order of text, especially in complex layouts and tabular formats.

1. Word Error Rate

WER of each model on the FUNSD and STROIE datasets is presented below. These results highlight how well each model preserves word order across different document layouts.

2. Character Error Rate

CER of each model on the FUNSD and STROIE datasets is presented below. These results indicate how accurately each model captures character-level text while handling different document layouts.

Character Error Rate on the FUNSD dataset

Character Error Rate on the STROIE dataset

Why are the WER and CER metrics for Surya and Marker so high on the STROIE dataset?

STROIE’s intricate layouts make OCR difficult. Surya tries to fill gaps by inserting extra words, leading to high WER and CER, even after post-processing. Marker, which uses Surya for OCR and outputs markdown text, inherits these issues. The markdown formatting further misaligns text, worsening the error rates.

Variation in Model Performance Across Datasets

OCR models perform differently based on dataset structure. Google Cloud Vision AI and Azure AI Document Intelligence handle diverse layouts better, while open-source models like RapidOCR and Surya struggle with structured formats, leading to more errors.

Since the models struggle to preserve layouts, leading to high WER and CER, we use another metric—ROUGE Score—to assess text similarity between the model's output and the ground truth. Unlike WER and CER, ROUGE focuses on content similarity rather than word position. This means that regardless of layout preservation, a high ROUGE score indicates that the extracted text closely matches the reference, while a low score suggests significant content discrepancies.

3. ROUGE Score

ROUGE Score of each model on the FUNSD and STROIE datasets is presented below. These results reflect the content similarity between the extracted text and the ground truth, regardless of layout preservation.

The ROUGE scores reveal that Google Cloud Vision AI consistently outperforms other models across both FUNSD (75.0%) and STROIE (87.8%), indicating superior text extraction. Surya and Marker, which rely on the same backend, show comparable performance, though Marker slightly lags on STROIE (70.3%). Extractous and Unstructured-IO score the lowest in both datasets, suggesting weaker text coherence. PaddleOCR and Azure AI Document Intelligence achieve balanced results, making them competitive alternatives. The overall trend highlights the strength of commercial APIs, while open-source models exhibit mixed performance.

If you’d like to run the models yourself and compute the evaluation scores, you can use this GitHub repository. The repository includes scripts for running OCR APIs on the datasets, calculating WER, CER, and ROUGE scores.

4. Latency per image

Latency per image for each model is presented below. This measures the time taken by each model to perform OCR on one image, providing insights into their efficiency and processing speed.

Latency per image or Inference time per image

The latency analysis shows that Google Cloud Vision AI, Amazon Textract, and Extractous maintain a good balance between speed and accuracy. Surya and Paddle OCR exhibit notably higher inference times, suggesting potential inefficiencies. Open-source models like Rapid OCR and Marker vary in performance, with some offering competitive speeds while others lag behind. Azure AI Document Intelligence also shows moderate latency, making it a viable choice depending on the use case.

5. Cost or memory usage per image

For commercial APIs, we present the usage cost (cost per 1000 images processed). For open-source models, the metric indicates memory consumption as a proxy for cost, providing insights into their resource efficiency.

OCR API	Cost per 1,000 Pages
Google Cloud Vision AI	$1.50
Amazon Textract	$1.50
Azure AI Document Intelligence	$0.50

Among open-source models, Marker and Unstructured-IO have significantly higher memory consumption, which may impact deployment in resource-constrained environments. Surya and Extractous strike a balance between performance and memory efficiency. PaddleOCR and RapidOCR are the most lightweight options, making them ideal for low-memory scenarios.

Conclusion

Based on the evaluation across latency, inference time, and ROUGE score, no single model dominates in all aspects. However, some models stand out in specific areas:

Best Latency & Inference Time:

Extractous and Amazon Textract

Best ROUGE Score (Accuracy):

Google Cloud Vision AI and Azure AI Document Intelligence

Best Memory Efficiency:

RapidOCR and PaddleOCR

Best Model Overall

Considering a balance between accuracy, speed, and efficiency, Google Cloud Vision AI emerges as the best overall performer. It provides strong accuracy with competitive inference time. However, for open sourced models, PaddleOCR and RapidOCR offer the best trade-off between accuracy, speed and memory efficiency.

Leaderboard of Best OCR APIs based on different performance metrics:

Metric	Best Model	Score / Value
Highest Accuracy (ROUGE Score)	Google Cloud Vision AI	Best ROUGE Score
Best Layout Handling (Least WER & CER)	Google Cloud Vision AI	Lowest WER & CER
Fastest OCR (Lowest Latency)	Extractous	Lowest Processing Time
Memory Efficient	RapidOCR	Least Memory Usage
Most Cost-Effective among Commercial APIs	Azure AI Document Intelligence	Lowest Cost Per Page

LLM vs. Dedicated OCR: A Case Study

To understand how OCR models compare to Large Language Models (LLMs) in text extraction, we tested a challenging image using both LLaMa 3.2 11B Vision and RapidOCR, a small but dedicated OCR model.

Results:

LLaMa 3.2 11B Vision

Struggled with faint text, failing to reconstruct certain words.Misinterpreted some characters and added hallucinated words.Took significantly longer to process the image.Used a lot of compute resources.

RapidOCR

Accurately extracted most of the text despite the difficult conditions.Ran efficiently on very low compute resources.

Is OCR Still Relevant Today?

With the rise of multimodal LLMs capable of interpreting images and text, some believe OCR may become obsolete. However, the reality is more nuanced.

If you or your end customers need to be 100% sure of data you're extracting from documents or images, OCR still is your best bet for now! Confidence scores and bounding boxes from OCR APIs can be used to infer when the output is not reliable.

With LLMs you can never be 100% sure of the veracity of the text output because of hallucinations and the lack of confidence scores.

Who Still Needs OCR?

Enterprises Handling High-Volume Documents:

Governments and Compliance:

AI-Powered Data Pipelines:

Multilingual and Low-Resource Language Applications:

Why Should Enterprises Still Care About OCR When Everyone Wants LLMs?

Accuracy and Reliability

Speed and Cost Efficiency

The future is not OCR vs. LLMs—it is OCR and LLMs:

Final Thoughts

While LLMs have expanded the possibilities of text extraction from images, OCR remains indispensable for structured, high-accuracy text retrieval and will always be crucial for reliable document processing. Rather than replacing OCR, LLMs will complement it, bringing better understanding, context, and automation to extracted data.

Methodology

Dataset Selection:

Models Selection

Evaluation Metrics

Benchmarking Results

1. Word Error Rate

2. Character Error Rate

Why are the WER and CER metrics for Surya and Marker so high on the STROIE dataset?

Variation in Model Performance Across Datasets

3. ROUGE Score

4. Latency per image

5. Cost or memory usage per image

Conclusion

Best Model Overall

LLM vs. Dedicated OCR: A Case Study

Is OCR Still Relevant Today?

Who Still Needs OCR?

Why Should Enterprises Still Care About OCR When Everyone Wants LLMs?

Final Thoughts

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签