MarkTechPost@AI 03月25日 13:31
A Coding Implementation of Extracting Structured Data Using LangSmith, Pydantic, LangChain, and Claude 3.7 Sonnet
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍如何使用LangChain和Claude 3.7 Sonnet从非结构化文本中提取结构化数据,并转化为可操作的见解。重点在于使用LangSmith进行LLM工具调用的追踪,实现提取系统的实时调试和性能监控。文章利用Pydantic模式进行精确的数据格式化,并使用LangChain的灵活提示来引导Claude。通过实例驱动的改进,避免了复杂训练的需求。展示了LangSmith的能力,构建强大的提取管道,适用于文档处理和自动化数据录入等多种应用。

💡 教程首先介绍了安装必要的软件包,包括langchain-core和langchain_anthropic,以便与Claude模型进行交互。

⚙️ 为了进行追踪和调试,教程展示了如何设置LangSmith的环境变量,包括API密钥和项目名称。

👤 教程定义了使用Pydantic模型创建的Person模式,用于结构化地表示人物信息,包括姓名、发色和身高。

📝 教程还构建了一个提示模板,指导Claude执行提取任务,明确告知模型只提取相关信息,并处理缺失信息的情况。

🤖 教程初始化了Claude模型,并配置LLM根据Person模式返回结构化输出,这是将非结构化文本转化为结构化数据的关键步骤。

Unlock the power of structured data extraction with LangChain and Claude 3.7 Sonnet, transforming raw text into actionable insights. This tutorial focuses on tracing LLM tool calling using LangSmith, enabling real-time debugging and performance monitoring of your extraction system. We utilize Pydantic schemas for precise data formatting and LangChain’s flexible prompting to guide Claude. Experience example-driven refinement, eliminating the need for complex training. This is a glimpse into LangSmith’s capabilities, showcasing how to build robust extraction pipelines for diverse applications, from document processing to automated data entry.

First, we need to install the necessary packages. We’ll use langchain-core and langchain_anthropic to interface with the Claude model.

!pip install --upgrade langchain-core!pip install langchain_anthropic

If you’re using LangSmith for tracing and debugging, you can set up environment variables:

LANGSMITH_TRACING=TrueLANGSMITH_ENDPOINT="https://api.smith.langchain.com"LANGSMITH_API_KEY="Your API KEY"LANGSMITH_PROJECT="extraction_api"

Next, we must define the schema for the information we want to extract. We’ll use Pydantic models to create a structured representation of a person.

from typing import Optionalfrom pydantic import BaseModel, Fieldclass Person(BaseModel):    """Information about a person."""    name: Optional[str] = Field(default=None, description="The name of the person")    hair_color: Optional[str] = Field(        default=None, description="The color of the person's hair if known"    )    height_in_meters: Optional[str] = Field(        default=None, description="Height measured in meters"    )

Now, we’ll define a prompt template that instructs Claude on how to perform the extraction task:

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholderprompt_template = ChatPromptTemplate.from_messages(    [        (            "system",            "You are an expert extraction algorithm. "            "Only extract relevant information from the text. "            "If you do not know the value of an attribute asked to extract, "            "return null for the attribute's value.",        ),        ("human", "{text}"),    ])

This template provides clear instructions to the model about its task and how to handle missing information.

Next, we’ll initialize the Claude model that will perform our information extraction:

import getpassimport osif not os.environ.get("ANTHROPIC_API_KEY"):    os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Enter API key for Anthropic: ")from langchain.chat_models import init_chat_modelllm = init_chat_model("claude-3-7-sonnet-20250219", model_provider="anthropic")

Now, we’ll configure our LLM to return structured output according to our schema:

structured_llm = llm.with_structured_output(schema=Person)

This key step tells the model to format its responses according to our Person schema.

Let’s test our extraction system with a simple example:

text = "Alan Smith is 6 feet tall and has blond hair."prompt = prompt_template.invoke({"text": text})result = structured_llm.invoke(prompt)print(result)

Now, Let’s try a more complex example:

from typing import Listclass Data(BaseModel):    """Container for extracted information about people."""    people: List[Person] = Field(default_factory=list, description="List of people mentioned in the text")structured_llm = llm.with_structured_output(schema=Data)text = "My name is Jeff, my hair is black and I am 6 feet tall. Anna has the same color hair as me."prompt = prompt_template.invoke({"text": text})result = structured_llm.invoke(prompt)print(result)# Next exampletext = "The solar system is large, (it was discovered by Nicolaus Copernicus), but earth has only 1 moon."prompt = prompt_template.invoke({"text": text})result = structured_llm.invoke(prompt)print(result)

In conclusion, this tutorial demonstrates building a structured information extraction system with LangChain and Claude that transforms unstructured text into organized data about people. The approach uses Pydantic schemas, custom prompts, and example-driven improvement without requiring specialized training pipelines. The system’s power comes from its flexibility, domain adaptability, and utilization of advanced LLM reasoning capabilities.


Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

The post A Coding Implementation of Extracting Structured Data Using LangSmith, Pydantic, LangChain, and Claude 3.7 Sonnet appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LangChain Claude 3.7 Sonnet 结构化数据提取 LangSmith Pydantic
相关文章