Build Your Own OCR Engine for Wingdings

Optical Character Recognition (OCR) has revolutionized the way we interact with textual data in real life, enabling machines to read and interpret text from images, scanned documents, and handwritten notes. From digitizing books and automating data entry to real-time text translation in augmented reality, OCR applications are incredibly diverse and impactful. Some of its application may include:

Document Digitization

Invoice Scanning

Data Entry Automation

Real-Time Translation

License Plate Recognition

Accessibility Tools

Archiving and Preservation

In this post, we take OCR a step further by building a custom OCR model for recognizing text in the Wingdings font—a symbolic font with unique characters often used in creative and technical contexts. While traditional OCR models are trained for standard text, this custom model bridges the gap for niche applications, unlocking possibilities for translating symbolic text into readable English, whether for accessibility, design, or archival purposes. Through this, we demonstrate the power of OCR to adapt and cater to specialized use cases in the modern world.

For developers and managers looking to streamline document workflows such as OCR extraction and beyond, tools like the Nanonets PDF AI offer valuable integration options. Coupled with state of the art LLM capabilities, these can significantly enhance your workflows, ensuring efficient data handling. Additionally, tools like Nanonets’ PDF Summarizer can further automate processes by summarizing lengthy documents.

Is There a Need for Custom OCR in the Age of Vision-Language Models?

Vision-language models, such as Flamingo, Qwen2-VL, have revolutionized how machines understand images and text by bridging the gap between the two modalities. They can process and reason about images and associated text in a more generalized manner.

Despite their impressive capabilities, there remains a need for custom OCR systems in specific scenarios, primarily due to:

Accuracy for Specific Languages or Scripts

Lightweight and Resource-Constrained Environments

Data Privacy and Security

Cost-Effectiveness:

Build a Custom OCR Model for Wingdings

To explore the potential of custom OCR systems, we will build an OCR engine specifically for the Wingdings font.

Below are the steps and components we will follow:

custom dataset

Wingdings

Vision Transformer for Scene Text Recognition (ViTSTR)

Wingdings

For this project, we will utilize Google Colab for training, leveraging its 16 GB T4 GPU for faster computation.

Creating a Wingdings Dataset

What is Wingdings?

Wingdings is a symbolic font developed by Microsoft that consists of a collection of icons, shapes, and pictograms instead of traditional alphanumeric characters. Introduced in 1990, Wingdings maps keyboard inputs to graphical symbols, such as arrows, smiley faces, checkmarks, and other decorative icons. It is often used for design purposes, visual communication, or as a playful font in digital content.

Due to its symbolic nature, interpreting Wingdings text programmatically poses a challenge, making it an interesting use case for custom OCR systems.

Dataset Creation

Since no existing dataset is available for Optical Character Recognition (OCR) in Wingdings font, we created one from scratch. The process involves generating images of words in the Wingdings font and mapping them to their corresponding English words.

To achieve this, we used the Wingdings Translator to convert English words into their Wingdings representations. For each converted word, an image was manually generated and stored in a folder named "wingdings_word_images".

Additionally, we create a "metadata.csv" file to maintain a structured record of the dataset along with the image path. This file contains two columns:

Image Path

English Word

The dataset can be downloaded from this link.

Preprocessing the Dataset

The images in the dataset vary in size due to the manual creation process. To ensure uniformity and compatibility with OCR models, we preprocess the images by resizing and padding them.

import pandas as pdimport numpy as npfrom PIL import Imageimport osfrom tqdm import tqdmdef pad_image(image, target_size=(224, 224)):    """Pad image to target size while maintaining aspect ratio"""    if image.mode != 'RGB':        image = image.convert('RGB')        # Get current size    width, height = image.size        # Calculate padding    aspect_ratio = width / height    if aspect_ratio > 1:        # Width is larger        new_width = target_size[0]        new_height = int(new_width / aspect_ratio)    else:        # Height is larger        new_height = target_size[1]        new_width = int(new_height * aspect_ratio)        # Resize image maintaining aspect ratio    image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)        # Create new image with padding    new_image = Image.new('RGB', target_size, (255, 255, 255))        # Paste resized image in center    paste_x = (target_size[0] - new_width) // 2    paste_y = (target_size[1] - new_height) // 2    new_image.paste(image, (paste_x, paste_y))        return new_image# Read the metadatadf = pd.read_csv('metadata.csv')# Create output directory for processed imagesprocessed_dir = 'processed_images'os.makedirs(processed_dir, exist_ok=True)# Process each imagenew_paths = []for idx, row in tqdm(df.iterrows(), total=len(df), desc="Processing images"):    # Load image    img_path = row['image_path']    img = Image.open(img_path)        # Pad image    processed_img = pad_image(img)        # Save processed image    new_path = os.path.join(processeddir, f'processed{os.path.basename(img_path)}')    processed_img.save(new_path)    new_paths.append(new_path)# Update dataframe with new pathsdf['processed_image_path'] = new_pathsdf.to_csv('processed_metadata.csv', index=False)print("Image preprocessing completed!")print(f"Total images processed: {len(df)}")

First, each image is resized to a fixed height while maintaining its aspect ratio to preserve the visual structure of the Wingdings characters. Next, we apply padding to make all images the same dimensions, typically a square shape, to fit the input requirements of neural networks. The padding is added symmetrically around the resized image, with the background color matching the original image's background.

Splitting the Dataset

The dataset is divided into three subsets: training (70%), validation (dev) (15%), and testing (15%). The training set is used to teach the model, the validation set helps fine-tune hyperparameters and monitor overfitting, and the test set evaluates the model’s performance on unseen data. This random split ensures each subset is diverse and representative, promoting effective generalization.

import pandas as pdfrom sklearn.model_selection import train_test_split# Read the processed metadatadf = pd.read_csv('processed_metadata.csv')# First split: train and temporarytrain_df, temp_df = train_test_split(df, train_size=0.7, random_state=42)# Second split: validation and test from temporaryval_df, test_df = train_test_split(temp_df, train_size=0.5, random_state=42)# Save splits to CSVtrain_df.to_csv('train.csv', index=False)val_df.to_csv('val.csv', index=False)test_df.to_csv('test.csv', index=False)print("Data split statistics:")print(f"Training samples: {len(train_df)}")print(f"Validation samples: {len(val_df)}")print(f"Test samples: {len(test_df)}")

Visualizing the Dataset

To better understand the dataset, we visualize samples from each split. Specifically, we display 5 examples from the training set, 5 from the validation set, and 5 from the test set. Each visualization includes the Wingdings text as an image alongside its corresponding label in English. This step provides a clear overview of the data distribution across the splits and ensures the correctness of the dataset mappings.

import matplotlib.pyplot as pltfrom PIL import Imageimport pandas as pddef plot_samples(df, num_samples=5, title="Sample Images"):    # Set larger font sizes    plt.rcParams.update({        'font.size': 14,          # Base font size        'axes.titlesize': 16,     # Subplot title font size        'figure.titlesize': 20    # Main title font size    })        fig, axes = plt.subplots(1, num_samples, figsize=(20, 4))    fig.suptitle(title, fontsize=20, y=1.05)        # Randomly sample images    sample_df = df.sample(n=numsamples)        for idx, (, row) in enumerate(sample_df.iterrows()):        img = Image.open(row['processed_image_path'])        axes[idx].imshow(img)        axes[idx].set_title(f"Label: {row['english_word_label']}", fontsize=16, pad=10)        axes[idx].axis('off')        plt.tight_layout()    plt.show()# Load splitstrain_df = pd.read_csv('train.csv')val_df = pd.read_csv('val.csv')test_df = pd.read_csv('test.csv')# Plot samples from each splitplot_samples(train_df, title="Training Samples")plot_samples(val_df, title="Validation Samples")plot_samples(test_df, title="Test Samples")

Samples from the data are visualised as:

Train an OCR Model

First we need to import the required libraries and dependencies:

import torchimport torch.nn as nnfrom torch.utils.data import Dataset, DataLoaderfrom transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizerfrom PIL import Imageimport pandas as pdfrom tqdm import tqdm

Model Training with ViTSTR

We use a Vision Encoder-Decoder model, specifically ViTSTR (Vision Transformer for Scene Text Recognition). We fine-tune it for our Wingdings OCR task. The encoder processes the Wingdings text images using a ViT (Vision Transformer) backbone, while the decoder generates the corresponding English word labels.

Input image is first converted into patches (This image is for illustrative purposes only and may not be to scale or accurately represent practical dimensions). The patches are converted into 1D vector embeddings. As input to the encoder, a learnable patch embedding is added together with a position encod- ing for each embedding. The network is trained end-to-end to predict a sequence of characters. (GO] is a pre-defined start of sequence symbol while [s] represents a space or end of a character sequence.

During training, the model learns to map pixel-level information from the images to meaningful English text. The training and validation losses are monitored to assess model performance, ensuring it generalizes well. After training, the fine-tuned model is saved for inference on unseen Wingdings text images. We use pre-trained components from Hugging Face for our OCR pipeline and fine tune them. The ViTImageProcessor prepares images for the Vision Transformer (ViT) encoder, while the bert-base-uncased tokenizer processes English text labels for the decoder. The VisionEncoderDecoderModel, combining a ViT encoder and GPT-2 decoder, is fine-tuned for image captioning tasks, making it ideal for learning the Wingdings-to-English mapping.

class WingdingsDataset(Dataset):    def init(self, csv_path, processor, tokenizer):        self.df = pd.read_csv(csv_path)        self.processor = processor        self.tokenizer = tokenizer        def len(self):        return len(self.df)        def getitem(self, idx):        row = self.df.iloc[idx]        image = Image.open(row['processed_image_path'])        label = row['english_word_label']                # Process image        pixel_values = self.processor(image, return_tensors="pt").pixel_values                # Process label        encoding = self.tokenizer(            label,            padding="max_length",            max_length=16,            truncation=True,            return_tensors="pt"        )                return {            'pixel_values': pixel_values.squeeze(),            'labels': encoding.input_ids.squeeze(),            'text': label        }def train_epoch(model, dataloader, optimizer, device):    model.train()    total_loss = 0    progress_bar = tqdm(dataloader, desc="Training")        for batch in progress_bar:        pixel_values = batch['pixel_values'].to(device)        labels = batch['labels'].to(device)                outputs = model(pixel_values=pixel_values, labels=labels)        loss = outputs.loss                optimizer.zero_grad()        loss.backward()        optimizer.step()                total_loss += loss.item()        progress_bar.set_postfix({'loss': loss.item()})        return total_loss / len(dataloader)def validate(model, dataloader, device):    model.eval()    total_loss = 0        with torch.no_grad():        for batch in tqdm(dataloader, desc="Validating"):            pixel_values = batch['pixel_values'].to(device)            labels = batch['labels'].to(device)                        outputs = model(pixel_values=pixel_values, labels=labels)            loss = outputs.loss            total_loss += loss.item()        return total_loss / len(dataloader)# Initialize models and tokenizersprocessor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")# Create datasetstrain_dataset = WingdingsDataset('train.csv', processor, tokenizer)val_dataset = WingdingsDataset('val.csv', processor, tokenizer)# Create dataloaderstrain_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)val_loader = DataLoader(val_dataset, batch_size=32)# Setup trainingdevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')model = model.to(device)optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)num_epochs = 20 #(change according to need)# Training loopfor epoch in range(num_epochs):    print(f"\nEpoch {epoch+1}/{num_epochs}")        train_loss = train_epoch(model, train_loader, optimizer, device)    val_loss = validate(model, val_loader, device)        print(f"Training Loss: {train_loss:.4f}")    print(f"Validation Loss: {val_loss:.4f}")# Save the modelmodel.save_pretrained('wingdings_ocr_model')print("\nTraining completed and model saved!")

The training is carried for 20 epochs in Google Colab. Although it gives fair result with 20 epochs, it's a hyper parameter and can be increased to attain better accuracy. Dropout, Image Augmentation and Batch Normalization are a few more hyper-parameters one can play with to ensure model is not overfitting. The training stats and the loss and accuracy curve for train and validation sets on first and last epochs are given below:

Epoch 1/20Training: 100%|██████████| 22/22 [00:36<00:00,  1.64s/it, loss=1.13]Validating: 100%|██████████| 5/5 [00:02<00:00,  1.71it/s]Training Loss: 2.2776Validation Loss: 1.0183........................................Epoch 20/20Training: 100%|██████████| 22/22 [00:35<00:00,  1.61s/it, loss=0.0316]Validating: 100%|██████████| 5/5 [00:02<00:00,  1.73it/s]Training Loss: 0.0246Validation Loss: 0.5970Training completed and model saved!

Using the Saved Model

Once the model has been trained and saved, you can easily load it for inference on new Wingdings images. The test.csv file created during preprocessing is used to create the test_dataset. Here’s the code to load the saved model and make predictions:

# Load the trained modelmodel = VisionEncoderDecoderModel.from_pretrained('wingdings_ocr_model')processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")# Create test dataset and dataloadertest_dataset = WingdingsDataset('test.csv', processor, tokenizer)test_loader = DataLoader(test_dataset, batch_size=32)

Model Evaluation

After training, we evaluate the model's performance on the test split to measure its performance. To gain insights into the model's performance, we randomly select 10 samples from the test split. For each sample, we display the true label (English word) alongside the model's prediction and check if they match.

import seaborn as snsimport matplotlib.pyplot as pltfrom PIL import Imagedef plot_prediction_samples(image_paths, true_labels, pred_labels, num_samples=10):    # Set figure size and font sizes    plt.rcParams.update({        'font.size': 14,        'axes.titlesize': 18,        'figure.titlesize': 22    })        # Calculate grid dimensions    num_rows = 2    num_cols = 5    num_samples = min(num_samples, len(image_paths))        # Create figure    fig, axes = plt.subplots(num_rows, num_cols, figsize=(20, 8))    fig.suptitle('Sample Predictions from Test Set', fontsize=22, y=1.05)        # Flatten axes for easier indexing    axes_flat = axes.flatten()        for i in range(num_samples):        ax = axes_flat[i]                # Load and display image        img = Image.open(image_paths[i])        ax.imshow(img)                # Create label text        true_text = f"True: {true_labels[i]}"        pred_text = f"Pred: {pred_labels[i]}"                # Set color based on correctness        color = 'green' if true_labels[i] == pred_labels[i] else 'red'                # Add text above image        ax.set_title(f"{true_text}\n{pred_text}",                     fontsize=14,                    color=color,                    pad=10,                    bbox=dict(facecolor='white',                              alpha=0.8,                             edgecolor='none',                             pad=3))                # Remove axes        ax.axis('off')        # Remove any empty subplots    for i in range(num_samples, num_rows * num_cols):        fig.delaxes(axes_flat[i])        plt.tight_layout()    plt.show()    # Evaluationdevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')model = model.to(device)model.eval()predictions = []ground_truth = []image_paths = []with torch.no_grad():    for batch in tqdm(test_loader, desc="Evaluating"):        pixel_values = batch['pixel_values'].to(device)        texts = batch['text']                outputs = model.generate(pixel_values)        pred_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)                predictions.extend(pred_texts)        ground_truth.extend(texts)        image_paths.extend([row['processed_imagepath'] for , row in test_dataset.df.iterrows()])# Calculate and print accuracyaccuracy = accuracy_score(ground_truth, predictions)print(f"\nTest Accuracy: {accuracy:.4f}")# Display sample predictions in gridprint("\nDisplaying sample predictions:")plot_prediction_samples(image_paths, ground_truth, predictions)

The evaluation gives the following output:

Analysing the output given by the model, we find that the predictions match the reference/original labels fairly well. Although the last prediction is correct it is displayed in red because of the spaces in the generated text.

All the code and dataset used above can be found in this Github repository. And the end to end training can be found in the following colab notebook

Discussion

When we see the outputs, it becomes clear that the model performs really well. The predicted labels are accurate, and the visual comparison with the true labels demonstrates the model's strong capability in recognizing the correct classes.

The model's excellent performance could be attributed to the robust architecture of the Vision Transformer for Scene Text Recognition (ViTSTR). ViTSTR stands out due to its ability to seamlessly combine the power of Vision Transformers (ViT) with language models for text recognition tasks.

A comparison could be made by experimenting with different ViT architecture sizes, such as varying the number of layers, embedding dimensions, or the number of attention heads. Models like ViT-Base, ViT-Large, and ViT-Huge can be tested, along with alternative architectures like:

DeiT (Data-efficient Image Transformer)Swin Transformer

By evaluating these models of different scales, we can identify which architecture is the most efficient in terms of performance and computational resources. This will help determine the optimal model size that balances accuracy and efficiency for the given task.

For tasks like extracting information from documents, tools such as Nanonets’ Chat with PDF have evaluated and used several state of the LLMs along with custom in-house trained models and can offer a reliable way to interact with content, ensuring accurate data extraction without risk of misrepresentation.