MarkTechPost@AI 2024年08月07日
PleIAs Released OCRonos-Vintage: A 124 Million Parameter Model Trained on 18 Billion Tokens for Superior OCR Correction in Cultural Heritage Archives
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

PleIAs推出OCRonos-Vintage模型,用于文化遗产档案的OCR校正,具有多种优势

🎯OCRonos-Vintage是专为OCR校正设计的预训练模型,参数为1.24亿,在文化遗产档案的180亿个标记上进行训练,能有效校正历史文档中的OCR错误,且在特定应用中表现出色,尽管其规模相对较小,但展示了创建高度专业化模型的趋势。

💻该模型在Jean Zay的新H100集群上使用llm.c进行训练,llm.c是Andrej Karpathy开发的新预训练库,用于教学目的,其先进的数据预处理管道和高效性能使训练过程顺利且高效。

💰专业化预训练具有多种优势,如成本效益高,像OCRonos-Vintage这样具有1 - 3亿参数的模型可在大多数CPU基础设施上部署,在GPU环境中吞吐量更高,适合处理大量数据;还允许增加定制化,可针对特定任务和数据设计模型架构和分词器;且能完全控制使用的数据,避免数据责任问题。

🌟PleIAs通过OCRonos-Vintage展示了专业化预训练可在保持效率和成本效益的同时实现卓越性能,为未来各种应用的专业AI模型开发树立了先例。

PleIAs recently announced the release of OCRonos-Vintage, a specialized pre-trained model designed specifically for Optical Character Recognition (OCR) correction. This innovative model represents a significant milestone in OCR technology, particularly in its application to cultural heritage archives.

OCRonos-Vintage is a 124 million-parameter model uniquely trained on 18 billion tokens from cultural heritage archives. This specialized training aims to enhance the model’s performance in correcting OCR errors in historical documents. OCRonos-Vintage has demonstrated exceptional efficacy in this niche application despite its relatively small size compared to other models. Its development highlights the growing trend of creating highly specialized models tailored to specific tasks instead of relying solely on large, generalist models.

The training of OCRonos-Vintage was conducted using the new H100 cluster on Jean Zay, supported by a compute grant. The model was trained with llm.c, a new pre-training library developed by Andrej Karpathy. Created for pedagogical purposes, this library has proven highly effective for training models from scratch. The combination of advanced data preprocessing pipelines and the efficient performance of llm.c allowed the training process to proceed smoothly and efficiently.

Specialized pre-training, as exemplified by OCRonos-Vintage, is becoming increasingly viable and attractive for several reasons. One of the primary advantages is cost efficiency. Models with 100-300 million parameters, like OCRonos-Vintage, can be deployed on most CPU infrastructures without extensive adaptation or quantization. In GPU environments, these models offer significantly higher throughput. This efficiency is particularly important for processing large volumes of data, such as the vast cultural heritage archives targeted by OCRonos-Vintage.

Another key benefit of specialized pre-training is the increased customization it allows. The architecture and tokenizer of a model can be specifically designed with the target task and data in mind. For OCR correction, a tokenizer trained on a small sample of noisy data can outperform more generalist models. This approach allows optimizing the model for specific requirements, such as handling long contexts or improving comprehension in non-English languages. The potential for fast inference and enhanced performance, even at the letter or byte level tokenization, makes specialized models highly adaptable and efficient.

Specialized pre-training offers full control over the data used. In regulated environments, deploying or fine-tuning existing models can raise concerns about data liabilities. Specialized models like OCRonos-Vintage, trained end-to-end on selected datasets, avoid these issues. All training data for OCRonos-Vintage comes from cultural heritage archives in the public domain, ensuring compliance with data use regulations and promoting transparency.

As PleIAs continue experimenting with and iterating on other tasks, such as summarization and classification, the insights gained from OCRonos-Vintage will likely inform the development of future specialized models. The broader implications of this approach suggest that small, efficient models can achieve remarkable performance in reasoning-intensive tasks, challenging the conventional emphasis on large parameter counts for logical consistency.

In conclusion, PleIAs’ launch of OCRonos-Vintage marks a significant milestone in the evolution of specialized AI models. By focusing on specific tasks and optimizing models, PleIAs demonstrate that specialized pre-training can deliver exceptional performance while maintaining efficiency and cost-effectiveness. This approach advances the OCR correction field and sets a precedent for developing specialized AI models across various applications.


Check out the Model and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here


The post PleIAs Released OCRonos-Vintage: A 124 Million Parameter Model Trained on 18 Billion Tokens for Superior OCR Correction in Cultural Heritage Archives appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OCRonos-Vintage 专业化预训练 OCR校正 成本效益 数据控制
相关文章