上一篇文章我们说了如何与模型进行交互,下面要处理的问题就是如何获取数据,并在向量数据库中进行检索。同样,Langchain也提供了一系列的工具
文档加载器
加载PDF
pip install langchain-community pymupdf
from dotenv import load_dotenvload_dotenv('../.env')from langchain_community.document_loaders import PyMuPDFLoaderdef load_pdf(): loader = PyMuPDFLoader('../data/deepseek-v3-1-4.pdf') pages = loader.load_and_split() print(pages[0].page_content)if __name__ == '__main__': load_pdf()
输出:
加载csv
def load_csv(): loader = CSVLoader('../data/test.csv') data = loader.load() for record in data[:2]: print(record)
输出:
更多加载器可参考官方文档
文档切分
pip install --upgrade langchain-text-splitters
def split_doc(): pages = load_pdf() text_splitter = RecursiveCharacterTextSplitter( chunk_size=200, # 分段数量 chunk_overlap=100, # 重叠 length_function=len, add_start_index=True, ) paragraphs = text_splitter.create_documents([pages[0].page_content]) for para in paragraphs: print(para.page_content) print('-------')
写入向量数据库与数据检索
from dotenv import load_dotenvload_dotenv('../.env')from langchain_openai import OpenAIEmbeddingsfrom langchain_text_splitters import RecursiveCharacterTextSplitterfrom langchain_community.vectorstores import FAISSfrom langchain_community.document_loaders import PyMuPDFLoader# 加载文档loader = PyMuPDFLoader("../data/deepseek-v3-1-4.pdf")pages = loader.load_and_split()# 文档切分text_splitter = RecursiveCharacterTextSplitter( chunk_size=300, chunk_overlap=100, length_function=len, add_start_index=True,)texts = text_splitter.create_documents( [page.page_content for page in pages[:4]])# 灌库embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")db = FAISS.from_documents(texts, embeddings)# 检索 top-3 结果retriever = db.as_retriever(search_kwargs={"k": 3})docs = retriever.invoke("deepseek-v3代码能力怎么样")for doc in docs: print(doc.page_content) print('===============')
输出:
需要注意的是,LangChain提供的只是向量数据库的接口封装,参考:python.langchain.com/docs/integr…
关于这一部分没有太多需要特殊说明的,参考官方文档操作即可