MarkTechPost@AI 2024年09月12日
Jina AI Released Reader-LM-0.5B and Reader-LM-1.5B: Revolutionizing HTML-to-Markdown Conversion with Multilingual, Long-Context, and Highly Efficient Small Language Models for Web Data Processing
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Jina AI发布Reader-LM-0.5B和Reader-LM-1.5B,旨在将HTML高效转换为Markdown,解决了复杂的网页内容处理问题,具有多语言支持和高效性能。

💻Jina AI认识到传统HTML到Markdown转换方法的不足,提出用语言模型解决问题,研发了Reader-LM系列模型,其能更高效地处理任务,且参数少于大型语言模型,运行成本低。

📄Reader-LM模型专为将原始HTML转换为Markdown而训练,具有多语言支持和长达256K令牌的上下文长度处理能力。Reader-LM-0.5B参数为4.94亿,Reader-LM-1.5B参数为15.4亿,二者在处理不同复杂程度的HTML结构时各有优势。

📊Reader-LM模型的性能经过严格评估,在生成准确干净的Markdown方面表现出色,如Reader-LM-1.5B的ROUGE-L评分为0.72等,优于许多大型语言模型。

🎓Reader-LM模型的训练分两阶段,准备了高质量的HTML和Markdown数据对,采用创新的机制优化处理过程,且可在多种环境中应用,适用于自动化数据提取和清理等实际场景。

The release of Reader-LM-0.5B and Reader-LM-1.5B by Jina AI marks a significant milestone in small language model (SLM) technology. These models are designed to solve a unique and specific challenge: converting raw, noisy HTML from the open web into clean markdown format. While seemingly straightforward, this task poses complex challenges, particularly in handling the vast noise in modern web content such as headers, footers, and sidebars. The Reader-LM series aims to address this challenge efficiently, focusing on cost-effectiveness and performance.

Background and Purpose

In April 2024, Jina AI introduced Jina Reader, an API that converts any URL into a markdown suitable for large language models (LLMs). This API relies on tools like Mozilla’s Readability package to extract the main content from a webpage, followed by regex and the Turndown library to convert cleaned HTML into markdown. However, this method faced issues, such as incorrect content filtering and difficulties in converting complex HTML structures. As user feedback poured in, Jina AI realized that patching the existing pipeline with more regex patterns and heuristics was not a sustainable solution.

To overcome these limitations, Jina AI asked an important question: Could this problem be solved end-to-end using a language model? Instead of relying on manually curated rules, a language model could handle the task of HTML-to-markdown conversion more efficiently, especially with fewer than a billion parameters, making it feasible to run on the edge.

Introduction of Reader-LM Models

Jina AI released two small language models: Reader-LM-0.5B and Reader-LM-1.5B. These models are trained specifically to convert raw HTML into markdown, and both are multilingual with support for up to 256K tokens of context length. This ability to handle large contexts is critical, as HTML content from modern websites often contains more noise than ever before, with inline CSS, JavaScript, and other elements inflating the token count significantly.

While large language models are known for their high computational requirements, small language models like Reader-LM are designed to offer efficient performance without expensive infrastructure. Reader-LM-0.5B and Reader-LM-1.5B outperform many larger models in the specific task of HTML-to-markdown conversion while being just a fraction of their size.

Architecture and Specifications

The Reader-LM models are designed to handle long-context inputs and perform selective copying from HTML to markdown. This task is simpler than typical LLM functions such as text generation or code writing. This selective-copy behavior focuses primarily on identifying relevant content, skipping over unnecessary elements like sidebars and headers, and formatting the remaining content in markdown syntax.

Model Specifications

Both models support a context length of up to 256K tokens, which is crucial for processing the often lengthy and noisy HTML content found on the web. Their ability to handle multilingual content makes them versatile global application tools.

Performance and Benchmarking

The performance of Reader-LM-0.5B and Reader-LM-1.5B has been rigorously evaluated against several large language models, including GPT-4o, Gemini-1.5-Flash, LLaMA-3.1-70B, and Qwen2-7BInstruct. The models were tested using metrics like ROUGE-L (for summarization and question-answering tasks), Token Error Rate (TER, which measures the rate of hallucinated content), and Word Error Rate (WER, which assesses mismatches between generated markdown and the original HTML).

In these evaluations, Reader-LM models outperformed many larger models in terms of generating clean, accurate markdowns from HTML. For example, Reader-LM-1.5B achieved a ROUGE-L score of 0.72, a WER of 1.87, and a TER of 0.19, significantly better than GPT-4o and other models tested. Reader-LM-0.5B, while smaller, also delivered competitive results, especially in the task of structure preservation, which is vital for converting HTML into markdown.

Training and Development

Training Reader-LM models required preparing high-quality data pairs of raw HTML and corresponding markdown. Jina AI generated this data using its existing Jina Reader API, supplemented by synthetic HTML generated by GPT-4o for training purposes. The final training dataset contained approximately 2.5 billion tokens.

The models were trained in two stages:

Despite the complexity of HTML-to-markdown conversion, the models were optimized to handle this task effectively without unnecessary computational overhead. They leverage techniques like contrastive search to prevent token degeneration and repetitive loops during markdown generation.

Real-World Applications

Reader-LM is designed for practical use in both individual and enterprise environments. The models can be easily tested using Google Colab, and production environments can leverage platforms like Azure and AWS, where the models will soon be available. Reader-LM is licensed under CC BY-NC 4.0, with commercial usage options available for companies seeking on-premises solutions.

The models are ideal for automating data extraction and cleaning from the open web in production environments. By converting raw HTML into clean markdown, Reader-LM enables efficient data processing, making it easier for downstream LLMs to summarize, reason, and generate insights from web content. Additionally, its multilingual capabilities broaden its applicability to various industries and regions.

Conclusion

The release of Reader-LM-0.5B and Reader-LM-1.5B represents a leap forward in small language model technology, specifically tailored for HTML-to-markdown conversion. These models address a critical need for efficient, cost-effective data extraction from the noisy and often overwhelming web content that characterizes the modern internet. With their compact size, long-context support, and multilingual capabilities, Reader-LM models offer a powerful tool for developers and enterprises looking to optimize their data workflows.


Check out the ??????-??-?.??, ??????-??-1.?? and Colab Notebook. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post Jina AI Released Reader-LM-0.5B and Reader-LM-1.5B: Revolutionizing HTML-to-Markdown Conversion with Multilingual, Long-Context, and Highly Efficient Small Language Models for Web Data Processing appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Jina AI Reader-LM HTML转换 Markdown 语言模型
相关文章