MarkTechPost@AI 2024年11月10日
HtmlRAG: Enhancing RAG Systems with Richer Semantic and Structural Information through HTML
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

HtmlRAG是一种新方法,旨在通过在检索增强生成(RAG)系统中使用HTML格式的知识,来保留比纯文本更丰富的语义和结构信息。传统的RAG系统将HTML文档转换为纯文本,导致信息丢失。HtmlRAG则保留HTML结构,利用LLM的上下文窗口能力,并通过两步剪枝机制处理HTML文档,有效控制令牌长度并保留关键信息。实验结果表明,HtmlRAG在多个数据集上优于基线方法,证明了使用HTML格式进行知识检索的有效性,为RAG系统的发展提供了新的方向。

🤔HtmlRAG提出了一种新的RAG系统知识处理方法,使用HTML而不是纯文本作为检索知识的格式,以保留更丰富的语义和结构信息,避免了纯文本转换导致的信息损失。

🛠️为了解决原始HTML文档令牌长度过长和噪音干扰问题,HtmlRAG采用了双阶段剪枝机制,首先利用Beautiful Soup将HTML文档解析成DOM树,然后将其转换为更优化的“块树”结构,并通过嵌入模型和生成模型进一步处理。

📊实验结果表明,HtmlRAG在六个数据集上均优于基线方法,尤其在处理复杂网页内容(如表格)方面表现出色,证明了其在保留结构信息和提高LLM性能方面的有效性。

💡HtmlRAG为RAG系统的发展指明了新的方向,鼓励未来研究者探索更多基于HTML的知识检索和处理方法,以提高LLM的知识获取和应用能力。

Retrieval-augmented generation (RAG) has been shown to improve knowledge capabilities and reduce the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG and many commercial systems such as ChatGPT. However, current RAG implementations face a fundamental challenge in their knowledge-processing approach. The conventional method of converting HTML documents into plain text before feeding them to LLMs results in a substantial loss of structural and semantic information. This limitation becomes evident when dealing with complex web content like tables, where the conversion process disrupts the original format and discards crucial HTML tags that carry important contextual information.

The existing methods to enhance RAG systems have focused on various components and frameworks. Traditional RAG pipelines use elements like query rewriters, retrievers, re-rankers, refiners, and readers, as implemented in frameworks like LangChain and LlamaIndex. The Post-retrieval processing method is explored through chunking-based and abstractive refiners to optimize the content sent to LLMs. Moreover, research in structured data understanding has demonstrated the superior information richness of HTML and Excel tables compared to plain text. However, these existing solutions face limitations when dealing with HTML content, as traditional chunking methods cannot effectively handle HTML structure, and abstractive refiners struggle with long HTML content and have high computational costs.

Researchers from the Gaoling School of Artificial Intelligence, Renmin University of China, and Baichuan Intelligent Technology, China have proposed HtmlRAG, a method that uses HTML instead of plain text as the format of retrieved knowledge in RAG systems to preserve richer semantic and structured information that is missing in plain text. This method utilizes recent advances in LLMs’ context window capabilities and the versatility of HTML as a format that can accommodate various document types like LaTeX, PDF, and Word with minimal information loss. Moreover, the researchers identified significant challenges in implementing this approach, particularly the extensive token length of raw HTML documents and the presence of noise in the CSS styles, JavaScript, and comments format, which comprise over 90% of the tokens.

HtmlRAG implements a two-step pruning mechanism to process retrieved HTML documents efficiently. Initially, the system concatenates all retrieved HTML documents and parses them into a single DOM tree using Beautiful Soup. To address the computational challenges posed by the fine-grained nature of traditional DOM trees, the researchers developed an optimized “block tree” structure. This approach allows for adjustable granularity controlled by a maxWords parameter. Moreover, the block tree construction process recursively merges fragmented child nodes into their parent nodes, creating larger blocks while maintaining the word limit constraint. The pruning process then operates in two distinct phases: the first utilizes an embedding model to process the cleaned HTML output, followed by a generative model for further refinement.

The results show HtmlRAG’s superior performance across six datasets outperforming baseline methods in all evaluation metrics. The results show limited utilization of structural information compared to HtmlRAG while examining chunking-based refiners that follow LangChain’s approach. Among re-rankers, dense retrievers outperformed the sparse retriever BM25, with the encoder-based BGE showing better results than the decoder-based e5-mistral. Moreover, the abstractive refiners show notable limitations: LongLLMLingua struggles with HTML document optimization and lost structural information in plain text conversion, while JinaAI-reader, despite generating refined Markdown from HTML input, faced challenges with token-by-token decoding and high computational demands for long sequences.

In conclusion, researchers have introduced an approach called HtmlRAG that uses HTML as the format of retrieved knowledge in RAG systems to preserve rich semantic and structured information not present in plain text. The implemented HTML cleaning and pruning techniques effectively manage token length while preserving essential structural and semantic information. HtmlRAG’s superior performance compared to traditional plain-text-based post-retrieval processes validates the effectiveness of utilizing HTML format for knowledge retrieval. The researchers provide an immediate practical solution and establish a promising new direction for future developments in RAG systems, encouraging further innovations in HTML-based knowledge retrieval and processing methods.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[AI Magazine/Report] Read Our Latest Report on ‘SMALL LANGUAGE MODELS

The post HtmlRAG: Enhancing RAG Systems with Richer Semantic and Structural Information through HTML appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

HtmlRAG RAG HTML LLM 知识检索
相关文章