How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain Framework

In this tutorial, we demonstrate how to build a powerful and intelligent question-answering system by combining the strengths of Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain framework. The pipeline leverages real-time web search using Tavily, semantic document caching with Chroma vector store, and contextual response generation through the Gemini model. These tools are integrated through LangChain’s modular components, such as RunnableLambda, ChatPromptTemplate, ConversationBufferMemory, and GoogleGenerativeAIEmbeddings. It goes beyond simple Q&A by introducing a hybrid retrieval mechanism that checks for cached embeddings before invoking fresh web searches. The retrieved documents are intelligently formatted, summarized, and passed through a structured LLM prompt, with attention to source attribution, user history, and confidence scoring. Key functions such as advanced prompt engineering, sentiment and entity analysis, and dynamic vector store updates make this pipeline suitable for advanced use cases like research assistance, domain-specific summarization, and intelligent agents.

Copy CodeCopiedUse a different Browser

!pip install -qU langchain-community tavily-python langchain-google-genai streamlit matplotlib pandas tiktoken chromadb langchain_core pydantic langchain

We install and upgrade a comprehensive set of libraries required to build an advanced AI search assistant. It includes tools for retrieval (tavily-python, chromadb), LLM integration (langchain-google-genai, langchain), data handling (pandas, pydantic), visualization (matplotlib, streamlit), and tokenization (tiktoken). These components form the core foundation for constructing a real-time, context-aware QA system.

Copy CodeCopiedUse a different Browser

import osimport getpassimport pandas as pdimport matplotlib.pyplot as pltimport numpy as npimport jsonimport timefrom typing import List, Dict, Any, Optionalfrom datetime import datetime

We import essential Python libraries used throughout the notebook. It includes standard libraries for environment variables, secure input, time tracking, and data types (os, getpass, time, typing, datetime). Additionally, it brings in core data science tools like pandas, matplotlib, and numpy for data handling, visualization, and numerical computations, as well as json for parsing structured data.

Copy CodeCopiedUse a different Browser

if "TAVILY_API_KEY" not in os.environ:    os.environ["TAVILY_API_KEY"] = getpass.getpass("Enter Tavily API key: ")   if "GOOGLE_API_KEY" not in os.environ:    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter Google API key: ")import logginglogging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')logger = logging.getLogger(__name__)

We securely initialize API keys for Tavily and Google Gemini by prompting users only if they’re not already set in the environment, ensuring safe and repeatable access to external services. It also configures a standardized logging setup using Python’s logging module, which helps monitor execution flow and capture debug or error messages throughout the notebook.

Copy CodeCopiedUse a different Browser

from langchain_community.retrievers import TavilySearchAPIRetrieverfrom langchain_community.vectorstores import Chromafrom langchain_core.documents import Documentfrom langchain_core.output_parsers import StrOutputParser, JsonOutputParserfrom langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplatefrom langchain_core.runnables import RunnablePassthrough, RunnableLambdafrom langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddingsfrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain.chains.summarize import load_summarize_chainfrom langchain.memory import ConversationBufferMemory

We import key components from the LangChain ecosystem and its integrations. It brings in the TavilySearchAPIRetriever for real-time web search, Chroma for vector storage, and GoogleGenerativeAI modules for chat and embedding models. Core LangChain modules like ChatPromptTemplate, RunnableLambda, ConversationBufferMemory, and output parsers enable flexible prompt construction, memory handling, and pipeline execution.

Copy CodeCopiedUse a different Browser

class SearchQueryError(Exception):    """Exception raised for errors in the search query."""    passdef format_docs(docs):    formatted_content = []    for i, doc in enumerate(docs):        metadata = doc.metadata        source = metadata.get('source', 'Unknown source')        title = metadata.get('title', 'Untitled')        score = metadata.get('score', 0)               formatted_content.append(            f"Document {i+1} [Score: {score:.2f}]:n"            f"Title: {title}n"            f"Source: {source}n"            f"Content: {doc.page_content}n"        )       return "nn".join(formatted_content)

We define two essential components for search and document handling. The SearchQueryError class creates a custom exception to manage invalid or failed search queries gracefully. The format_docs function processes a list of retrieved documents by extracting metadata such as title, source, and relevance score and formatting them into a clean, readable string.

Copy CodeCopiedUse a different Browser

class SearchResultsParser:    def parse(self, text):        try:            if isinstance(text, str):                import re                import json                json_match = re.search(r'{.*}', text, re.DOTALL)                if json_match:                    json_str = json_match.group(0)                    return json.loads(json_str)                return {"answer": text, "sources": [], "confidence": 0.5}            elif hasattr(text, 'content'):                return {"answer": text.content, "sources": [], "confidence": 0.5}            else:                return {"answer": str(text), "sources": [], "confidence": 0.5}        except Exception as e:            logger.warning(f"Failed to parse JSON: {e}")            return {"answer": str(text), "sources": [], "confidence": 0.5}

The SearchResultsParser class provides a robust method for extracting structured information from LLM responses. It attempts to parse a JSON-like string from the model output, returning to a plain text response format if parsing fails. It gracefully handles string outputs and message objects, ensuring consistent downstream processing. In case of errors, it logs a warning and returns a fallback response containing the raw answer, empty sources, and a default confidence score, enhancing the system’s fault tolerance.

Copy CodeCopiedUse a different Browser

class EnhancedTavilyRetriever:    def __init__(self, api_key=None, max_results=5, search_depth="advanced", include_domains=None, exclude_domains=None):        self.api_key = api_key        self.max_results = max_results        self.search_depth = search_depth        self.include_domains = include_domains or []        self.exclude_domains = exclude_domains or []        self.retriever = self._create_retriever()        self.previous_searches = []           def _create_retriever(self):        try:            return TavilySearchAPIRetriever(                api_key=self.api_key,                k=self.max_results,                search_depth=self.search_depth,                include_domains=self.include_domains,                exclude_domains=self.exclude_domains            )        except Exception as e:            logger.error(f"Failed to create Tavily retriever: {e}")            raise       def invoke(self, query, **kwargs):        if not query or not query.strip():            raise SearchQueryError("Empty search query")               try:            start_time = time.time()            results = self.retriever.invoke(query, **kwargs)            end_time = time.time()                       search_record = {                "timestamp": datetime.now().isoformat(),                "query": query,                "num_results": len(results),                "response_time": end_time - start_time            }            self.previous_searches.append(search_record)                       return results        except Exception as e:            logger.error(f"Search failed: {e}")            raise SearchQueryError(f"Failed to perform search: {str(e)}")       def get_search_history(self):        return self.previous_searches

The EnhancedTavilyRetriever class is a custom wrapper around the TavilySearchAPIRetriever, adding greater flexibility, control, and traceability to search operations. It supports advanced features like limiting search depth, domain inclusion/exclusion filters, and configurable result counts. The invoke method performs web searches and tracks each query’s metadata (timestamp, response time, and result count), storing it for later analysis.

Copy CodeCopiedUse a different Browser

class SearchCache:    def __init__(self):        self.embedding_function = GoogleGenerativeAIEmbeddings(model="models/embedding-001")        self.vector_store = None        self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)           def add_documents(self, documents):        if not documents:            return               try:            if self.vector_store is None:                self.vector_store = Chroma.from_documents(                    documents=documents,                    embedding=self.embedding_function                )            else:                self.vector_store.add_documents(documents)        except Exception as e:            logger.error(f"Failed to add documents to cache: {e}")       def search(self, query, k=3):        if self.vector_store is None:            return []               try:            return self.vector_store.similarity_search(query, k=k)        except Exception as e:            logger.error(f"Vector search failed: {e}")            return []

The SearchCache class implements a semantic caching layer that stores and retrieves documents using vector embeddings for efficient similarity search. It uses GoogleGenerativeAIEmbeddings to convert documents into dense vectors and stores them in a Chroma vector database. The add_documents method initializes or updates the vector store, while the search method enables fast retrieval of the most relevant cached documents based on semantic similarity. This reduces redundant API calls and improves response times for repeated or related queries, serving as a lightweight hybrid memory layer in the AI assistant pipeline.

Copy CodeCopiedUse a different Browser

search_cache = SearchCache()enhanced_retriever = EnhancedTavilyRetriever(max_results=5)memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)system_template = """You are a research assistant that provides accurate answers based on the search results provided.Follow these guidelines:1. Only use the context provided to answer the question2. If the context doesn't contain the answer, say "I don't have sufficient information to answer this question."3. Cite your sources by referencing the document numbers4. Don't make up information5. Keep the answer concise but completeContext: {context}Chat History: {chat_history}"""system_message = SystemMessagePromptTemplate.from_template(system_template)human_template = "Question: {question}"human_message = HumanMessagePromptTemplate.from_template(human_template)prompt = ChatPromptTemplate.from_messages([system_message, human_message])

We initialize the core components of the AI assistant: a semantic SearchCache, the EnhancedTavilyRetriever for web-based querying, and a ConversationBufferMemory to retain chat history across turns. It also defines a structured prompt using ChatPromptTemplate, guiding the LLM to act as a research assistant. The prompt enforces strict rules for factual accuracy, context usage, source citation, and concise answering, ensuring reliable and grounded responses.

Copy CodeCopiedUse a different Browser

def get_llm(model_name="gemini-2.0-flash-lite", temperature=0.2, response_mode="json"):    try:        return ChatGoogleGenerativeAI(            model=model_name,            temperature=temperature,            convert_system_message_to_human=True,            top_p=0.95,            top_k=40,            max_output_tokens=2048        )    except Exception as e:        logger.error(f"Failed to initialize LLM: {e}")        raiseoutput_parser = SearchResultsParser()

We define the get_llm function, which initializes a Google Gemini language model with configurable parameters such as model name, temperature, and decoding settings (e.g., top_p, top_k, and max tokens). It ensures robustness with error handling for failed model initialization. An instance of SearchResultsParser is also created to standardize and structure the LLM’s raw responses, enabling consistent downstream processing of answers and metadata.

Copy CodeCopiedUse a different Browser

def plot_search_metrics(search_history):    if not search_history:        print("No search history available")        return       df = pd.DataFrame(search_history)       plt.figure(figsize=(12, 6))    plt.subplot(1, 2, 1)    plt.plot(range(len(df)), df['response_time'], marker='o')    plt.title('Search Response Times')    plt.xlabel('Search Index')    plt.ylabel('Time (seconds)')    plt.grid(True)       plt.subplot(1, 2, 2)    plt.bar(range(len(df)), df['num_results'])    plt.title('Number of Results per Search')    plt.xlabel('Search Index')    plt.ylabel('Number of Results')    plt.grid(True)       plt.tight_layout()    plt.show()

The plot_search_metrics function visualizes performance trends from past queries using Matplotlib. It converts the search history into a DataFrame and plots two subgraphs: one showing response time per search and the other displaying the number of results returned. This aids in analyzing the system’s efficiency and search quality over time, helping developers fine-tune the retriever or identify bottlenecks in real-world usage.

Copy CodeCopiedUse a different Browser

def retrieve_with_fallback(query):    cached_results = search_cache.search(query)       if cached_results:        logger.info(f"Retrieved {len(cached_results)} documents from cache")        return cached_results       logger.info("No cache hit, performing web search")    search_results = enhanced_retriever.invoke(query)       search_cache.add_documents(search_results)       return search_resultsdef summarize_documents(documents, query):    llm = get_llm(temperature=0)       summarize_prompt = ChatPromptTemplate.from_template(        """Create a concise summary of the following documents related to this query: {query}               {documents}               Provide a comprehensive summary that addresses the key points relevant to the query.        """    )       chain = (        {"documents": lambda docs: format_docs(docs), "query": lambda _: query}        | summarize_prompt        | llm        | StrOutputParser()    )       return chain.invoke(documents)

These two functions enhance the assistant’s intelligence and efficiency. The retrieve_with_fallback function implements a hybrid retrieval mechanism: it first attempts to fetch semantically relevant documents from the local Chroma cache and, if unsuccessful, falls back to a real-time Tavily web search, caching the new results for future use. Meanwhile, summarize_documents leverages a Gemini LLM to generate concise summaries from retrieved documents, guided by a structured prompt that ensures relevance to the query. Together, they enable low-latency, informative, and context-aware responses.

Copy CodeCopiedUse a different Browser

def advanced_chain(query_engine="enhanced", model="gemini-1.5-pro", include_history=True):    llm = get_llm(model_name=model)       if query_engine == "enhanced":        retriever = lambda query: retrieve_with_fallback(query)    else:        retriever = enhanced_retriever.invoke       def chain_with_history(input_dict):        query = input_dict["question"]        chat_history = memory.load_memory_variables({})["chat_history"] if include_history else []               docs = retriever(query)               context = format_docs(docs)               result = prompt.invoke({            "context": context,            "question": query,            "chat_history": chat_history        })               memory.save_context({"input": query}, {"output": result.content})               return llm.invoke(result)       return RunnableLambda(chain_with_history) | StrOutputParser()

The advanced_chain function defines a modular, end-to-end reasoning workflow for answering user queries using cached or real-time search. It initializes the specified Gemini model, selects the retrieval strategy (cached fallback or direct search), constructs a response pipeline incorporating chat history (if enabled), formats documents into context, and prompts the LLM using a system-guided template. The chain also logs the interaction in memory and returns the final answer, parsed into clean text. This design enables flexible experimentation with models and retrieval strategies while maintaining conversation coherence.

Copy CodeCopiedUse a different Browser

qa_chain = advanced_chain()def analyze_query(query):    llm = get_llm(temperature=0)       analysis_prompt = ChatPromptTemplate.from_template(        """Analyze the following query and provide:        1. Main topic        2. Sentiment (positive, negative, neutral)        3. Key entities mentioned        4. Query type (factual, opinion, how-to, etc.)               Query: {query}               Return the analysis in JSON format with the following structure:        {{            "topic": "main topic",            "sentiment": "sentiment",            "entities": ["entity1", "entity2"],            "type": "query type"        }}        """    )       chain = analysis_prompt | llm | output_parser       return chain.invoke({"query": query})print("Advanced Tavily-Gemini Implementation")print("="*50)query = "what year was breath of the wild released and what was its reception?"print(f"Query: {query}")

We initialize the final components of the intelligent assistant. qa_chain is the assembled reasoning pipeline ready to process user queries using retrieval, memory, and Gemini-based response generation. The analyze_query function performs a lightweight semantic analysis on a query, extracting the main topic, sentiment, entities, and query type using the Gemini model and a structured JSON prompt. The example query, about Breath of the Wild’s release and reception, showcases how the assistant is triggered and prepared for full-stack inference and semantic interpretation. The printed heading marks the start of interactive execution.

Copy CodeCopiedUse a different Browser

try:    print("nSearching for answer...")    answer = qa_chain.invoke({"question": query})    print("nAnswer:")    print(answer)       print("nAnalyzing query...")    try:        query_analysis = analyze_query(query)        print("nQuery Analysis:")        print(json.dumps(query_analysis, indent=2))    except Exception as e:        print(f"Query analysis error (non-critical): {e}")except Exception as e:    print(f"Error in search: {e}")history = enhanced_retriever.get_search_history()print("nSearch History:")for i, h in enumerate(history):    print(f"{i+1}. Query: {h['query']} - Results: {h['num_results']} - Time: {h['response_time']:.2f}s")print("nAdvanced search with domain filtering:")specialized_retriever = EnhancedTavilyRetriever(    max_results=3,    search_depth="advanced",    include_domains=["nintendo.com", "zelda.com"],    exclude_domains=["reddit.com", "twitter.com"])try:    specialized_results = specialized_retriever.invoke("breath of the wild sales")    print(f"Found {len(specialized_results)} specialized results")       summary = summarize_documents(specialized_results, "breath of the wild sales")    print("nSummary of specialized results:")    print(summary)except Exception as e:    print(f"Error in specialized search: {e}")print("nSearch Metrics:")plot_search_metrics(history)

We demonstrate the complete pipeline in action. It performs a search using the qa_chain, displays the generated answer, and then analyzes the query for sentiment, topic, entities, and type. It also retrieves and prints each query’s search history, response time, and result count. Also, it runs a domain-filtered search focused on Nintendo-related sites, summarizes the results, and visualizes search performance using plot_search_metrics, offering a comprehensive view of the assistant’s capabilities in real-time use.

In conclusion, following this tutorial gives users a comprehensive blueprint for creating a highly capable, context-aware, and scalable RAG system that bridges real-time web intelligence with conversational AI. The Tavily Search API lets users directly pull fresh and relevant content from the web. The Gemini LLM adds robust reasoning and summarization capabilities, while LangChain’s abstraction layer allows seamless orchestration between memory, embeddings, and model outputs. The implementation includes advanced features such as domain-specific filtering, query analysis (sentiment, topic, and entity extraction), and fallback strategies using a semantic vector cache built with Chroma and GoogleGenerativeAIEmbeddings. Also, structured logging, error handling, and analytics dashboards provide transparency and diagnostics for real-world deployment.

Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

The post How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain Framework appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签