Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis

cs.AI updates on arXiv.org 15小时前

Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis

本文研究多模态视觉语言模型（VLMs）在户外广告OCR中的应用，对比了VLMs与CNN模型在识别复杂场景下的广告文字。结果显示，VLMs在整体场景理解上表现优异，而CNN模型在识别裁剪文字方面仍具竞争力，为边缘部署提供了重要参考。

arXiv:2507.11730v1 Announce Type: cross Abstract: Outdoor advertisements remain a critical medium for modern marketing, yet accurately verifying billboard text visibility under real-world conditions is still challenging. Traditional Optical Character Recognition (OCR) pipelines excel at cropped text recognition but often struggle with complex outdoor scenes, varying fonts, and weather-induced visual noise. Recently, multimodal Vision-Language Models (VLMs) have emerged as promising alternatives, offering end-to-end scene understanding with no explicit detection step. This work systematically benchmarks representative VLMs - including Qwen 2.5 VL 3B, InternVL3, and SmolVLM2 - against a compact CNN-based OCR baseline (PaddleOCRv4) across two public datasets (ICDAR 2015 and SVT), augmented with synthetic weather distortions to simulate realistic degradation. Our results reveal that while selected VLMs excel at holistic scene reasoning, lightweight CNN pipelines still achieve competitive accuracy for cropped text at a fraction of the computational cost-an important consideration for edge deployment. To foster future research, we release our weather-augmented benchmark and evaluation code publicly.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OCR VLMs 户外广告视觉语言模型 CNN

相关文章

Top Important Computer Vision Papers for the Week from 29/04 to 05/05

THRONE: Advancing the Evaluation of Hallucinations in Vision-Language Models

Google AI Introduces PaliGemma: A New Family of Vision Language Models

Researchers from UC Berkeley, UIUC, and NYU Developed an Algorithmic Framework that Uses Reinforcement Learning (RL) to Optimize Vision-Language Models (VLMs)

Demystifying Vision-Language Models: An In-Depth Exploration

Unlocking the Potential of Multimodal Data: A Look at Vision-Language Models and their Applications

Llama3-V: A SOTA Open-Source VLM Model Comparable performance to GPT4-V, Gemini Ultra, Claude Opus with a 100x Smaller Model

帕丽斯-希尔顿和 CNN 是 TikTok 网络攻击的目标账户之一

CNN 问汤姆-汉克斯是否担心特朗普再次当选总统。请听他的回答 - CNN

使用 ToolJet OCR 和 Gemini API 构建智能报销跟踪应用程序