MarkTechPost@AI 03月20日 13:11
Building a Retrieval-Augmented Generation (RAG) System with FAISS and Open-Source LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用RAG(Retrieval-Augmented Generation)架构来增强大型语言模型(LLM)的能力,解决LLM中存在的幻觉问题。文章详细阐述了利用FAISS作为向量数据库,Sentence Transformers创建高质量嵌入,以及Hugging Face的开源LLM构建RAG系统的步骤。通过将LLM的生成能力与检索系统的准确性相结合,RAG能够根据特定文档回答问题,提高准确性和相关性,适用于构建领域专家助手和客户支持系统等应用。

💡RAG(检索增强生成)是一种AI架构,它结合了信息检索和文本生成,旨在减少纯生成方法中常见的幻觉问题,并提供最新的信息,而无需重新训练模型。

📚RAG系统的工作流程通常包括:将用户查询转换为嵌入向量;使用向量相似性从知识库中检索相似的文档或段落;将检索到的内容作为上下文提供给语言模型;以及语言模型根据其参数和检索到的信息生成响应。

🧮向量数据库是专门设计的数据库系统,用于存储、管理和高效搜索向量嵌入。它们对于机器学习应用至关重要,尤其是在涉及自然语言处理和图像识别的应用中,并支持快速相似性搜索、各种距离度量以及处理数十亿向量的可伸缩性。

Retrieval-augmented generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models (LLMs). By combining LLMs’ creative generation abilities with retrieval systems’ factual accuracy, RAG offers a solution to one of LLMs’ most persistent challenges: hallucination.

In this tutorial, we’ll build a complete RAG system using:

By the end of this tutorial, you’ll have a functioning RAG system that can answer questions based on your documents with improved accuracy and relevance. This approach is valuable for building domain-specific assistants, customer support systems, or any application where grounding LLM responses in specific documents is important.

Let us get started.

Step 1: Setting Up Our Environment

First, we need to install all the required libraries. For this tutorial, we’ll use Google Colab.

# Install required packages!pip install -q transformers==4.34.0!pip install -q sentence-transformers==2.2.2!pip install -q faiss-cpu==1.7.4!pip install -q accelerate==0.23.0!pip install -q einops==0.7.0!pip install -q langchain==0.0.312!pip install -q langchain_community!pip install -q pypdf==3.15.1

Let’s also check if we have access to a GPU, which will speed up our model inference:

import torch# Check if GPU is availableprint(f"GPU available: {torch.cuda.is_available()}")if torch.cuda.is_available():   print(f"GPU name: {torch.cuda.get_device_name(0)}")else:   print("Running on CPU. We'll use a CPU-compatible model.")

Step 2: Creating Our Knowledge Base

For this tutorial, we’ll create a simple knowledge base about AI concepts. In a real-world scenario, one can use it to import PDF documents, web pages, or databases.

import osimport tempfile# Create a temporary directory for our documentsdocs_dir = tempfile.mkdtemp()print(f"Created temporary directory at {docs_dir}")# Create sample documents about AI conceptsdocuments = {   "vector_databases.txt": """   Vector databases are specialized database systems designed to store, manage, and search vector embeddings efficiently.   They are crucial for machine learning applications, particularly those involving natural language processing and image recognition.     Key features of vector databases include:   1. Fast similarity search using algorithms like HNSW, IVF, or exact search   2. Support for various distance metrics (cosine, euclidean, dot product)   3. Scalability for handling billions of vectors   4. Often support for metadata filtering alongside vector search     Popular vector databases include FAISS (Facebook AI Similarity Search), Pinecone, Weaviate, Milvus, and Chroma.   FAISS specifically was developed by Facebook AI Research and is an open-source library for efficient similarity search.   """,     "embeddings.txt": """   Embeddings are dense vector representations of data in a continuous vector space.   They capture semantic meaning and relationships between entities by positioning similar items closer together in the vector space.     Types of embeddings include:   1. Word embeddings (Word2Vec, GloVe)   2. Sentence embeddings (Universal Sentence Encoder, SBERT)   3. Document embeddings   4. Image embeddings   5. Audio embeddings     Embeddings are created through various techniques, including neural networks trained on specific tasks.   Modern embedding models like those from OpenAI, Cohere, or Sentence Transformers can capture nuanced semantic relationships.     The dimensionality of embeddings typically ranges from 100 to 1536 dimensions, with higher dimensions often capturing more information but requiring more storage and computation.   """,     "rag_systems.txt": """   Retrieval-Augmented Generation (RAG) is an AI architecture that combines information retrieval with text generation.     The RAG process typically works as follows:   1. User query is converted into an embedding vector   2. Similar documents or passages are retrieved from a knowledge base using vector similarity   3. Retrieved content is provided as context to the language model   4. The language model generates a response informed by both its parameters and the retrieved information     Benefits of RAG include:   1. Reduced hallucination compared to pure generative approaches   2. Up-to-date information without model retraining   3. Attribution of information sources   4. Lower computation costs than increasing model size     RAG systems can be enhanced through techniques like reranking, query reformulation, and hybrid search approaches.   """}# Write documents to filesfor filename, content in documents.items():   with open(os.path.join(docs_dir, filename), 'w') as f:       f.write(content)      print(f"Created {len(documents)} documents in {docs_dir}")

Step 3: Loading and Processing Documents

Now, let’s load these documents and process them for our RAG system:

from langchain_community.document_loaders import TextLoaderfrom langchain.text_splitter import RecursiveCharacterTextSplitter# Initialize a list to store our documentsall_documents = []# Load each text filefor filename in documents.keys():   file_path = os.path.join(docs_dir, filename)   loader = TextLoader(file_path)   loaded_docs = loader.load()   all_documents.extend(loaded_docs)print(f"Loaded {len(all_documents)} documents")# Split documents into chunkstext_splitter = RecursiveCharacterTextSplitter(   chunk_size=500,   chunk_overlap=50,   separators=["nn", "n", ".", " ", ""])document_chunks = text_splitter.split_documents(all_documents)print(f"Created {len(document_chunks)} document chunks")# Let's look at a sample chunkprint("nSample chunk content:")print(document_chunks[0].page_content)print(f"Source: {document_chunks[0].metadata}")

Step 4: Creating Embeddings

Now, let’s convert our document chunks into vector embeddings:

from sentence_transformers import SentenceTransformerimport numpy as np# Initialize the embedding modelmodel_name = "sentence-transformers/all-MiniLM-L6-v2"  # A good balance of speed and qualityembedding_model = SentenceTransformer(model_name)print(f"Loaded embedding model: {model_name}")print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")# Create embeddings for all document chunkstexts = [doc.page_content for doc in document_chunks]embeddings = embedding_model.encode(texts)print(f"Created {len(embeddings)} embeddings with shape {embeddings.shape}")

Step 5: Building the FAISS Index

Now we’ll build our FAISS index with these embeddings:

import faiss# Get the dimensionality of our embeddingsdimension = embeddings.shape[1]# Create a FAISS index - we'll use a simple Flat L2 index for demonstration# For larger datasets, consider using indexes like IVF or HNSW for better performanceindex = faiss.IndexFlatL2(dimension)  # L2 is Euclidean distance# Add our vectors to the indexindex.add(embeddings.astype(np.float32))  # FAISS requires float32print(f"Created FAISS index with {index.ntotal} vectors")# Create a mapping from index position to document chunk for retrievalindex_to_doc_chunk = {i: doc for i, doc in enumerate(document_chunks)}

Step 6: Loading a Language Model

Now let’s load an open-source language model from Hugging Face. We’ll use a smaller model that works well on CPU:

from transformers import AutoTokenizer, AutoModelForCausalLM# We'll use a smaller model that works on CPUmodel_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"# Load the tokenizer and modeltokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(   model_id,   torch_dtype=torch.float32,  # Use float32 for CPU compatibility   device_map="auto"  # Will use CPU if GPU is not available)print(f"Successfully loaded {model_id}")

Step 7: Creating Our RAG Pipeline

Let’s create a function that combines retrieval and generation:

def rag_response(query, index, embedding_model, llm_model, llm_tokenizer, index_to_doc_map, top_k=3):   """   Generate a response using the RAG pattern.   Args:       query: The user's question       index: FAISS index       embedding_model: Model to create embeddings       llm_model: Language model for generation       llm_tokenizer: Tokenizer for the language model       index_to_doc_map: Mapping from index positions to document chunks       top_k: Number of documents to retrieve   Returns:       response: The generated response       sources: The source documents used   """   # Step 1: Convert query to embedding   query_embedding = embedding_model.encode([query])   query_embedding = query_embedding.astype(np.float32)  # Convert to float32 for FAISS   # Step 2: Search for similar documents   distances, indices = index.search(query_embedding, top_k)   # Step 3: Retrieve the actual document chunks   retrieved_docs = [index_to_doc_map[idx] for idx in indices[0]]   # Create context from retrieved documents   context = "nn".join([doc.page_content for doc in retrieved_docs])   # Step 4: Create prompt for the LLM (TinyLlama format)   prompt = f"""<|system|>You are a helpful AI assistant. Answer the question based only on the provided context.If you don't know the answer based on the context, say "I don't have enough information to answer this question."Context:{context}<|user|>{query}<|assistant|>"""   # Step 5: Generate response from LLM   input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)   generation_config = {       "max_new_tokens": 256,       "temperature": 0.7,       "top_p": 0.95,       "do_sample": True   }   # Generate the output   with torch.no_grad():       output = llm_model.generate(           input_ids=input_ids,           *generation_config       )   # Decode the output   generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)   # Extract the assistant's response (remove the prompt)   response = generated_text.split("<|assistant|>")[-1].strip()   # Return both the response and the sources   sources = [(doc.page_content, doc.metadata) for doc in retrieved_docs]   return response, sources

Step 8: Testing Our RAG System

Let’s test our system with some questions:

#Define some test questionstest_questions = [   "What is FAISS and what is it used for?",   "How do embeddings capture semantic meaning?",   "What are the benefits of RAG systems?",   "How does vector search work?"]# Test our RAG pipelinefor question in test_questions:   print(f"nn{'='50}")   print(f"Question: {question}")   print(f"{'='*50}n")   response, sources = rag_response(       query=question,       index=index,       embedding_model=embedding_model,       llm_model=model,       llm_tokenizer=tokenizer,       index_to_doc_map=index_to_doc_chunk,       top_k=2  # Retrieve top 2 most relevant chunks   )   print(f"Response: {response}n")   print("Sources:")   for i, (content, metadata) in enumerate(sources):       print(f"nSource {i+1}:")       print(f"Metadata: {metadata}")       print(f"Content snippet: {content[:100]}...")

OUTPUT:

Step 9: Evaluating and Improving Our RAG System

Let’s implement a simple evaluation function to assess the performance of our RAG system:

def evaluate_rag_response(question, response, retrieved_sources, ground_truth_sources=None):   """   Simple evaluation of RAG response quality   Args:       question: The query       response: Generated response       retrieved_sources: Sources used for generation       ground_truth_sources: (Optional) Known correct sources   Returns:       evaluation metrics   """   # Basic metrics   response_length = len(response.split())   num_sources = len(retrieved_sources)   # Simple relevance score - we'd use better methods in production   sourcerelevance = []   for content,  in retrieved_sources:       # Count overlapping words between question and source       q_words = set(question.lower().split())       s_words = set(content.lower().split())       overlap = len(q_words.intersection(s_words))       source_relevance.append(overlap / len(q_words) if q_words else 0)   avg_relevance = sum(source_relevance) / len(source_relevance) if source_relevance else 0   return {       "response_length": response_length,       "num_sources": num_sources,       "source_relevance_scores": source_relevance,       "avg_relevance": avg_relevance   }# Evaluate one of our previous responsesquestion = test_questions[0]response, sources = rag_response(   query=question,   index=index,   embedding_model=embedding_model,   llm_model=model,   llm_tokenizer=tokenizer,   index_to_doc_map=index_to_doc_chunk,   top_k=2)# Run evaluationeval_results = evaluate_rag_response(question, response, sources)print(f"nEvaluation results for question: '{question}'")for metric, value in eval_results.items():   print(f"{metric}: {value}")

Step 10: Advanced RAG Techniques – Query Expansion

Let’s implement query expansion to improve retrieval:

# Here's the implementation of the expand_query function:def expand_query(original_query, llm_model, llm_tokenizer):   """   Generate multiple search queries from an original query to improve retrieval   Args:       original_query: The user's original question       llm_model: The language model for generating variations       llm_tokenizer: Tokenizer for the language model   Returns:       List of query variations including the original   """   # Create a prompt for query expansion   prompt = f"""<|system|>You are a helpful assistant. Generate two alternative versions of the given search query.The goal is to create variations that might help retrieve relevant information.Only list the alternative queries, one per line. Do not include any explanations, numbering, or other text.<|user|>Generate alternative versions of this search query: "{original_query}"<|assistant|>"""   # Generate variations   input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(llm_model.device)   with torch.no_grad():       output = llm_model.generate(           input_ids=input_ids,           max_new_tokens=100,           temperature=0.7,           do_sample=True       )   # Decode the output   generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)   # Extract the generated variations   response_part = generated_text.split("<|assistant|>")[-1].strip()   # Split response by lines to get individual variations   variations = [line.strip() for line in response_part.split('n') if line.strip()]   # Ensure we have at least some variations   if not variations:       variations = [original_query]   # Add the original query and return the list with duplicates removed   all_queries = [original_query] + variations   return list(dict.fromkeys(all_queries))  # Remove duplicates while preserving order

Step 11: Evaluating and Improving Our expand_query function

Let’s implement a simple evaluation function to assess the performance of our expand_query function

# Example usage of expand_query functiontest_query = "How does FAISS help with vector search?"# Generate query variationsexpanded_queries = expand_query(   original_query=test_query,   llm_model=model,   llm_tokenizer=tokenizer)print(f"Original Query: {test_query}")print(f"Expanded Queries:")for i, query in enumerate(expanded_queries):   print(f"  {i+1}. {query}")# Enhanced RAG with query expansionall_retrieved_docs = []all_scores = {}# Retrieve documents for each query variationfor query in expanded_queries:   # Get query embedding   query_embedding = embedding_model.encode([query]).astype(np.float32)   # Search in FAISS index   distances, indices = index.search(query_embedding, 3)   # Track document scores across queries (using 1/(1+distance) as score)   for idx, dist in zip(indices[0], distances[0]):       score = 1.0 / (1.0 + dist)       if idx in all_scores:           # Take max score if document retrieved by multiple query variations           all_scores[idx] = max(all_scores[idx], score)       else:           all_scores[idx] = score# Get top documents based on scorestop_indices = sorted(all_scores.keys(), key=lambda idx: all_scores[idx], reverse=True)[:3]expanded_retrieved_docs = [index_to_doc_chunk[idx] for idx in top_indices]print("nRetrieved documents using query expansion:")for i, doc in enumerate(expanded_retrieved_docs):   print(f"nResult {i+1}:")   print(f"Source: {doc.metadata['source']}")   print(f"Content snippet: {doc.page_content[:150]}...")# Now use these documents with the LLM to generate a responsecontext = "nn".join([doc.page_content for doc in expanded_retrieved_docs])# Create prompt for the LLMprompt = f"""<|system|>You are a helpful AI assistant. Answer the question based only on the provided context.If you don't know the answer based on the context, say "I don't have enough information to answer this question."Context:{context}<|user|>{test_query}<|assistant|>"""# Generate responseinput_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)with torch.no_grad():   output = model.generate(       input_ids=input_ids,       max_new_tokens=256,       temperature=0.7,       top_p=0.95,       do_sample=True   )# Extract responsegenerated_text = tokenizer.decode(output[0], skip_special_tokens=True)response = generated_text.split("<|assistant|>")[-1].strip()print("nFinal RAG Response with Query Expansion:")print(response)

Output:

FAISS can handle a wide range of vector types, including text, image, and audio, and can be integrated with popular machine learning frameworks such as TensorFlow, PyTorch, and Sklearn.

Conclusion

In this tutorial, we have built a complete RAG system using FAISS as our vector database and an open-source LLM. We implemented document processing, embedding generation, and vector indexing, and integrated these components with query expansion and hybrid search techniques to improve retrieval quality.

Further, we can consider:

Useful resources:


Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

The post Building a Retrieval-Augmented Generation (RAG) System with FAISS and Open-Source LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RAG LLM FAISS Sentence Transformers 向量数据库
相关文章