Published on June 15, 2025 11:20 AM GMT
SOTA multimodal LLMs can read text from images (e.g. signs, screenshots, book pages) really well.
Are they actually using an internal OCR system, or do they learn to "read" purely through pretraining (like contrastive learning on image-text pairs)?
Discuss