MarkTechPost@AI 03月20日
A Coding Implementation to Build a Document Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and Langchain
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用Hugging Face的嵌入模型、ChromaDB向量数据库和Sentence Transformers构建一个强大的语义文档搜索引擎。该引擎能够基于文档的含义而非简单的关键词匹配来查找文档。实现过程包括处理和嵌入文本文件、高效存储这些嵌入,以及检索与任何查询在语义上最相似的文档。文章详细介绍了安装必要的库、加载数据集、将文档分割成小块、创建文本嵌入、设置ChromaDB,并实现了搜索、交互式搜索和过滤搜索功能,最终构建了一个具备交互搜索、元数据过滤和相关性排序的文档搜索引擎。

💡**利用Hugging Face嵌入模型**:文章的核心在于利用Hugging Face的嵌入模型将文本转换为向量表示,从而实现语义搜索,超越了传统关键词搜索的局限性。

📚**ChromaDB向量数据库应用**:文章使用ChromaDB作为向量数据库,用于高效存储和检索文档嵌入,这是构建快速、可扩展的语义搜索引擎的关键。

🔍**实现交互式与过滤搜索**:文章不仅实现了基本的语义搜索功能,还进一步提供了交互式搜索界面和元数据过滤功能,增强了用户体验,使得搜索结果更加精准。

In today’s information-rich world, finding relevant documents quickly is crucial. Traditional keyword-based search systems often fall short when dealing with semantic meaning. This tutorial demonstrates how to build a powerful document search engine using:

    Hugging Face’s embedding models to convert text into rich vector representationsChroma DB as our vector database for efficient similarity searchSentence transformers for high-quality text embeddings

This implementation enables semantic search capabilities – finding documents based on meaning rather than just keyword matching. By the end of this tutorial, you’ll have a working document search engine that can:

Please follow the detailed steps mentioned below in sequence to implement DocSearchAgent.

First, we need to install the necessary libraries. 

!pip install chromadb sentence-transformers langchain datasets

Let’s start by importing the libraries we’ll use:

import osimport numpy as npimport pandas as pdfrom datasets import load_datasetimport chromadbfrom chromadb.utils import embedding_functionsfrom sentence_transformers import SentenceTransformerfrom langchain.text_splitter import RecursiveCharacterTextSplitterimport time

For this tutorial, we’ll use a subset of Wikipedia articles from the Hugging Face datasets library. This gives us a diverse set of documents to work with.

dataset = loaddataset("wikipedia", "20220301.en", split="train[:1000]")print(f"Loaded {len(dataset)} Wikipedia articles")documents = []for i, article in enumerate(dataset):   doc = {       "id": f"doc{i}",       "title": article["title"],       "text": article["text"],       "url": article["url"]   }   documents.append(doc)df = pd.DataFrame(documents)df.head(3)

Now, let’s split our documents into smaller chunks for more granular searching:

text_splitter = RecursiveCharacterTextSplitter(   chunk_size=1000,   chunk_overlap=200,   length_function=len,)chunks = []chunk_ids = []chunk_sources = []for i, doc in enumerate(documents):   doc_chunks = text_splitter.split_text(doc["text"])   chunks.extend(doc_chunks)   chunkids.extend([f"chunk{i}_{j}" for j in range(len(doc_chunks))])   chunk_sources.extend([doc["title"]]  len(doc_chunks))print(f"Created {len(chunks)} chunks from {len(documents)} documents")

We’ll use a pre-trained sentence transformer model from Hugging Face to create our embeddings:

model_name = "sentence-transformers/all-MiniLM-L6-v2"embedding_model = SentenceTransformer(model_name)sample_text = "This is a sample text to test our embedding model."sample_embedding = embedding_model.encode(sample_text)print(f"Embedding dimension: {len(sample_embedding)}")

Now, let’s set up Chroma DB, a lightweight vector database perfect for our search engine:

chroma_client = chromadb.Client()embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)collection = chroma_client.create_collection(   name="document_search",   embedding_function=embedding_function)batch_size = 100for i in range(0, len(chunks), batch_size):   end_idx = min(i + batch_size, len(chunks))     batch_ids = chunk_ids[i:end_idx]   batch_chunks = chunks[i:end_idx]   batch_sources = chunk_sources[i:end_idx]     collection.add(       ids=batch_ids,       documents=batch_chunks,       metadatas=[{"source": source} for source in batch_sources]   )     print(f"Added batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1} to the collection")print(f"Total documents in collection: {collection.count()}")

Now comes the exciting part – searching through our documents:

def search_documents(query, n_results=5):   """   Search for documents similar to the query.     Args:       query (str): The search query       n_results (int): Number of results to return     Returns:       dict: Search results   """   start_time = time.time()     results = collection.query(       query_texts=[query],       n_results=n_results   )     end_time = time.time()   search_time = end_time - start_time     print(f"Search completed in {search_time:.4f} seconds")   return resultsqueries = [   "What are the effects of climate change?",   "History of artificial intelligence",   "Space exploration missions"]for query in queries:   print(f"\nQuery: {query}")   results = search_documents(query)     for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])):       print(f"\nResult {i+1} from {metadata['source']}:")       print(f"{doc[:200]}...") 

Let’s create a simple function to provide a better user experience:

def interactive_search():   """   Interactive search interface for the document search engine.   """   while True:       query = input("\nEnter your search query (or 'quit' to exit): ")             if query.lower() == 'quit':           print("Exiting search interface...")           break                 n_results = int(input("How many results would you like? "))             results = search_documents(query, n_results)             print(f"\nFound {len(results['documents'][0])} results for '{query}':")             for i, (doc, metadata, distance) in enumerate(zip(           results['documents'][0],           results['metadatas'][0],           results['distances'][0]       )):           relevance = 1 - distance             print(f"\n--- Result {i+1} ---")           print(f"Source: {metadata['source']}")           print(f"Relevance: {relevance:.2f}")           print(f"Excerpt: {doc[:300]}...")             print("-"  50)interactive_search()

Let’s add the ability to filter our search results by metadata:

def filtered_search(query, filter_source=None, n_results=5):   """   Search with optional filtering by source.     Args:       query (str): The search query       filter_source (str): Optional source to filter by       n_results (int): Number of results to return     Returns:       dict: Search results   """   where_clause = {"source": filter_source} if filter_source else None     results = collection.query(       query_texts=[query],       n_results=n_results,       where=where_clause   )     return resultsunique_sources = list(set(chunk_sources))print(f"Available sources for filtering: {len(unique_sources)}")print(unique_sources[:5])  if len(unique_sources) > 0:   filter_source = unique_sources[0]   query = "main concepts and principles"     print(f"\nFiltered search for '{query}' in source '{filter_source}':")   results = filtered_search(query, filter_source=filter_source)     for i, doc in enumerate(results['documents'][0]):       print(f"\nResult {i+1}:")       print(f"{doc[:200]}...") 

In conclusion, we demonstrate how to build a semantic document search engine using Hugging Face embedding models and ChromaDB. The system retrieves documents based on meaning rather than just keywords by transforming text into vector representations. The implementation processes Wikipedia articles chunks them for granularity, embeds them using sentence transformers, and stores them in a vector database for efficient retrieval. The final product features interactive searching, metadata filtering, and relevance ranking.


Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

The post A Coding Implementation to Build a Document Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and Langchain appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语义搜索 Hugging Face ChromaDB
相关文章