MarkTechPost@AI 11小时前
Google AI Releases LangExtract: An Open Source Python Library that Extracts Structured Data from Unstructured Text Documents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google AI 推出了开源 Python 库 LangExtract,旨在解决从非结构化文本中提取有意义、可追溯信息的技术挑战。该库利用 Gemini 等大型语言模型,通过自然语言指令和少量样本示例,实现声明式、可追溯的数据提取,并能强制执行自定义输出模式,确保数据即时可用。LangExtract 适用于医疗、金融、法律、科研等多个领域,并提供交互式可视化报告,便于审计和分析。它能高效处理大量文本,并易于集成到现有 Python 工作流中,为数据驱动的世界带来更智能、更透明的信息提取解决方案。

✨ **声明式与可追溯的数据提取**:LangExtract 允许用户使用自然语言指令和高质量的“少量样本”示例来定义自定义提取任务,精确指定需要提取的实体、关系或事实,并确保所有提取的信息都能追溯到原文,便于验证、审计和端到端追踪。

🌍 **跨领域通用性**:该库不仅限于技术演示,还能在医疗(临床记录、医学报告)、金融(摘要、风险文档)、法律(合同)、研究文献乃至艺术(分析莎士比亚)等关键的现实世界领域发挥作用,例如自动提取临床文档中的药物、剂量和给药细节,或提取戏剧文学中的人物关系和情感。

🔧 **通过 LLM 强制执行模式**:借助 Gemini 和其他大型语言模型,LangExtract 能够强制执行自定义输出模式(如 JSON),使得提取结果不仅准确,而且能够直接用于下游数据库、分析或 AI 流程,解决了传统 LLM 在幻觉和模式漂移方面的弱点,将输出与用户指令和实际源文本相结合。

📈 **可扩展性与可视化**:LangExtract 能通过分块、并行处理和聚合结果来高效处理长文档,并能生成交互式 HTML 报告,用户可以在其中查看每个提取的实体及其在原始文档中的位置,从而实现无缝的审计和错误分析,同时支持在 Google Colab、Jupyter 或独立 HTML 文件中运行,加速开发和研究反馈循环。

In today’s data-driven world, valuable insights are often buried in unstructured text—be it clinical notes, lengthy legal contracts, or customer feedback threads. Extracting meaningful, traceable information from these documents is both a technical and practical challenge. Google AI’s new open-source Python library, LangExtract, is designed to address this gap directly, using LLMs like Gemini to deliver powerful, automated extraction with traceability and transparency at its core.

Key Innovations of LangExtract

1. Declarative and Traceable Extraction

LangExtract lets users define custom extraction tasks using natural language instructions and high-quality “few-shot” examples. This empowers developers and analysts to specify exactly which entities, relationships, or facts to extract, and in what structure. Crucially, every extracted piece of information is tied directly back to its source text—enabling validation, auditing, and end-to-end traceability.

2. Domain Versatility

The library works not just in tech demos but in critical real-world domains—including health (clinical notes, medical reports), finance (summaries, risk documents), law (contracts), research literature, and even the arts (analyzing Shakespeare). Original use cases include automatic extraction of medications, dosages, and administration details from clinical documents, as well as relationships and emotions from plays or literature.

3. Schema Enforcement with LLMs

Powered by Gemini and compatible with other LLMs, LangExtract enables enforcement of custom output schemas (like JSON), so results aren’t just accurate—they’re immediately usable in downstream databases, analytics, or AI pipelines. It solves traditional LLM weaknesses around hallucination and schema drift by grounding outputs to both user instructions and actual source text.

4. Scalability and Visualization

5. Installation and Usage

Install easily with pip:

Example Workflow (Extracting Character Info from Shakespeare):

import langextract as lximport textwrap# 1. Define your promptprompt = textwrap.dedent("""Extract characters, emotions, and relationships in order of appearance.Use exact text for extractions. Do not paraphrase or overlap entities.Provide meaningful attributes for each entity to add context.""")# 2. Give a high-quality exampleexamples = [    lx.data.ExampleData(        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",        extractions=[            lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),            lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),            lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),        ],    )]# 3. Extract from new textinput_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"result = lx.extract(    text_or_documents=input_text,    prompt_description=prompt,    examples=examples,    model_id="gemini-2.5-pro")# 4. Save and visualize resultslx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")html_content = lx.visualize("extraction_results.jsonl")with open("visualization.html", "w") as f:    f.write(html_content)

This results in structured, source-anchored JSON outputs, plus an interactive HTML visualization for easy review and demonstration.

Specialized & Real-World Applications

The team even provides a demonstration called RadExtract for structuring radiology reports—highlighting not just what was extracted, but exactly where the information appeared in the original input.

How LangExtract Compares

FeatureTraditional ApproachesLangExtract Approach
Schema ConsistencyOften manual/error-proneEnforced via instructions & few-shot examples
Result TraceabilityMinimalAll output linked to input text
Scaling to Long TextsWindowed, lossyChunked + parallel extraction, then aggregation
VisualizationCustom, usually absentBuilt-in, interactive HTML reports
DeploymentRigid, model-specificGemini-first, open to other LLMs & on-premises

In Summary

LangExtract presents a new era for extracting structured, actionable data from text—delivering:


Check out the GitHub Page and Technical Blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Google AI Releases LangExtract: An Open Source Python Library that Extracts Structured Data from Unstructured Text Documents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LangExtract Google AI 数据提取 大型语言模型 Python库
相关文章