OpenAI Cookbook 前天 15:08
No Title
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍如何利用OpenAI的Responses API和文件搜索工具,快速检索PDF文件中的信息,构建知识库。通过解析PDF文件,进行分块、嵌入、存储,用户可以便捷地从知识库中获取相关内容。文章详细阐述了创建向量存储、上传PDF文件、以及使用文件搜索工具检索信息的过程。最终,通过评估数据集,衡量检索结果的质量,帮助用户构建高效的知识检索系统。

📚 **创建向量存储:** 首先,在OpenAI API中创建向量存储,用于存储PDF文件的内容。通过API调用,可以指定存储的名称和ID,为后续的文件上传和检索做好准备。

📤 **上传PDF文件:** 将PDF文件上传到向量存储中。API会自动读取PDF文件,将其内容分割成多个文本块,并生成嵌入,然后将嵌入和文本存储在向量存储中,方便后续的查询。

🔍 **利用文件搜索工具:** 使用OpenAI的Responses API中的文件搜索工具,根据用户的查询,从向量存储中检索相关内容。该工具能够从知识库中提取相关信息,并生成基于检索内容的答案。

📊 **评估检索结果:** 为了衡量检索的质量,文章介绍了创建评估数据集的方法。通过生成问题,并计算不同指标,可以评估检索系统在特定场景下的表现,从而优化检索效果。

Although RAG can be overwhelming, searching amongst PDF file shouldn't be complicated. One of the most adopted options as of now is parsing your PDF, defining your chunking strategies, uploading those chunks to a storage provider, running embeddings on those chunks of texts and storing those embeddings in a vector database. And that's only the setup — retrieving content in our LLM workflow also requires multiple steps.

This is where file search — a hosted tool you can use in the Responses API — comes in. It allows you to search your knowledge base and generate an answer based on the retrieved content. In this cookbook, we'll upload those PDFs to a vector store on OpenAI and use file search to fetch additional context from this vector store to answer the questions we generated in the first step. Then, we'll initially create a small set of questions based on PDFs extracted from OpenAI's blog (openai.com/news).

File search was previously available on the Assistants API. It's now available on the new Responses API, an API that can be stateful or stateless, and with from new features like metadata filtering

!pip install PyPDF2 pandas tqdm openai -q
from openai import OpenAIfrom concurrent.futures import ThreadPoolExecutorfrom tqdm import tqdmimport concurrentimport PyPDF2import osimport pandas as pdimport base64client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))dir_pdfs = 'openai_blog_pdfs' # have those PDFs stored locally herepdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs)]

We will create a Vector Store on OpenAI API and upload our PDFs to the Vector Store. OpenAI will read those PDFs, separate the content into multiple chunks of text, run embeddings on those and store those embeddings and the text in the Vector Store. It will enable us to query this Vector Store to return relevant content based on a query.

def upload_single_pdf(file_path: str, vector_store_id: str):    file_name = os.path.basename(file_path)    try:        file_response = client.files.create(file=open(file_path, 'rb'), purpose="assistants")        attach_response = client.vector_stores.files.create(            vector_store_id=vector_store_id,            file_id=file_response.id        )        return {"file": file_name, "status": "success"}    except Exception as e:        print(f"Error with {file_name}: {str(e)}")        return {"file": file_name, "status": "failed", "error": str(e)}def upload_pdf_files_to_vector_store(vector_store_id: str):    pdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs)]    stats = {"total_files": len(pdf_files), "successful_uploads": 0, "failed_uploads": 0, "errors": []}        print(f"{len(pdf_files)} PDF files to process. Uploading in parallel...")    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:        futures = {executor.submit(upload_single_pdf, file_path, vector_store_id): file_path for file_path in pdf_files}        for future in tqdm(concurrent.futures.as_completed(futures), total=len(pdf_files)):            result = future.result()            if result["status"] == "success":                stats["successful_uploads"] += 1            else:                stats["failed_uploads"] += 1                stats["errors"].append(result)    return statsdef create_vector_store(store_name: str) -> dict:    try:        vector_store = client.vector_stores.create(name=store_name)        details = {            "id": vector_store.id,            "name": vector_store.name,            "created_at": vector_store.created_at,            "file_count": vector_store.file_counts.completed        }        print("Vector store created:", details)        return details    except Exception as e:        print(f"Error creating vector store: {e}")        return {}
store_name = "openai_blog_store"vector_store_details = create_vector_store(store_name)upload_pdf_files_to_vector_store(vector_store_details["id"])
Vector store created: {'id': 'vs_67d06b9b9a9c8191bafd456cf2364ce3', 'name': 'openai_blog_store', 'created_at': 1741712283, 'file_count': 0}21 PDF files to process. Uploading in parallel...
100%|███████████████████████████████| 21/21 [00:09<00:00,  2.32it/s]
{'total_files': 21, 'successful_uploads': 21, 'failed_uploads': 0, 'errors': []}

Now that our vector store is ready, we are able to query the Vector Store directly and retrieve relevant content for a specific query. Using the new vector search API, we're able to find relevant items from our knowledge base without necessarily integrating it in an LLM query.

query = "What's Deep Research?"search_results = client.vector_stores.search(    vector_store_id=vector_store_details['id'],    query=query)
for result in search_results.data:    print(str(len(result.content[0].text)) + ' of character of content from ' + result.filename + ' with a relevant score of ' + str(result.score))
3502 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.98135888653223933493 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.95224768251437143634 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.93979302965267962774 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.91019757473037713474 of character of content from Deep research System Card _ OpenAI.pdf with a relevant score of 0.90366476134642993123 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.8871209812882723343 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.84484548494328813262 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.7913452866555093271 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.74855300250919632721 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.734033360849088

We can see that different size (and under-the-hood different texts) have been returned from the search query. They all have different relevancy score that are calculated by our ranker which uses hybrid search.

However instead of querying the vector store and then passing the data into the Responses or Chat Completion API call, an even more convenient way to use this search results in an LLM query would be to plug use file_search tool as part of OpenAI Responses API.

query = "What's Deep Research?"response = client.responses.create(    input= query,    model="gpt-4o-mini",    tools=[{        "type": "file_search",        "vector_store_ids": [vector_store_details['id']],    }])# Extract annotations from the responseannotations = response.output[1].content[0].annotations    # Get top-k retrieved filenamesretrieved_files = set([result.filename for result in annotations])print(f'Files used: {retrieved_files}')print('Response:')print(response.output[1].content[0].text) # 0 being the filesearch call
Files used: {'Introducing deep research _ OpenAI.pdf'}Response:Deep Research is a new capability introduced by OpenAI that allows users to conduct complex, multi-step research tasks on the internet efficiently. Key features include:1. **Autonomous Research**: Deep Research acts as an independent agent that synthesizes vast amounts of information across the web, enabling users to receive comprehensive reports similar to those produced by a research analyst.2. **Multi-Step Reasoning**: It performs deep analysis by finding, interpreting, and synthesizing data from various sources, including text, images, and PDFs.3. **Application Areas**: Especially useful for professionals in fields such as finance, science, policy, and engineering, as well as for consumers seeking detailed information for purchases.4. **Efficiency**: The output is fully documented with citations, making it easy to verify information, and it significantly speeds up research processes that would otherwise take hours for a human to complete.5. **Limitations**: While Deep Research enhances research capabilities, it is still subject to limitations, such as potential inaccuracies in information retrieval and challenges in distinguishing authoritative data from unreliable sources.Overall, Deep Research marks a significant advancement toward automated general intelligence (AGI) by improving access to thorough and precise research outputs.

We can see that gpt-4o-mini was able to answer a query that required more recent, specialised knowledge about OpenAI's Deep Research. It used content from the file Introducing deep research _ OpenAI.pdf that had chunks of texts that were the most relevant. If we want to go even deeper in the analysis of chunk of text retrieved, we can also analyse the different texts that were returned by the search engine by adding include=["output[*].file_search_call.search_results"] to our query.

What is key for those information retrieval system is to also measure the relevance & quality of files retrieved for those answers. The following steps of this cookbook will consist in generating an evaluation dataset and calculating different metrics over this generated dataset. This is an imperfect approach and we'll always recommend to have a human-verified evaluation dataset for your own use-cases, but it will show you the methodology to evaluate those. It will be imperfect because some of the questions generated might be generic (e.g: What's said by the main stakeholder in this document) and our retrieval test will have a hard time to figure out which document that question was generated for.

We will create functions that will read through the PDFs we have locally and generate a question that can only be answered by this document. Therefore it'll create our evaluation dataset that we can use after.

def extract_text_from_pdf(pdf_path):    text = ""    try:        with open(pdf_path, "rb") as f:            reader = PyPDF2.PdfReader(f)            for page in reader.pages:                page_text = page.extract_text()                if page_text:                    text += page_text    except Exception as e:        print(f"Error reading {pdf_path}: {e}")    return textdef generate_questions(pdf_path):    text = extract_text_from_pdf(pdf_path)    prompt = (        "Can you generate a question that can only be answered from this document?:\n"        f"{text}\n\n"    )    response = client.responses.create(        input=prompt,        model="gpt-4o",    )    question = response.output[0].content[0].text    return question

If we run the function generate_question for the first PDF file we will be able to see the kind of question it generates.

generate_questions(pdf_files[0])
'What new capabilities will ChatGPT have as a result of the partnership between OpenAI and Schibsted Media Group?'

We can now generate all the questions for all the PDFs we've got stored locally.

# Generate questions for each PDF and store in a dictionaryquestions_dict = {}for pdf_path in pdf_files:    questions = generate_questions(pdf_path)    questions_dict[os.path.basename(pdf_path)] = questions
{'OpenAI partners with Schibsted Media Group _ OpenAI.pdf': 'What is the purpose of the partnership between Schibsted Media Group and OpenAI announced on February 10, 2025?', 'OpenAI and the CSU system bring AI to 500,000 students & faculty _ OpenAI.pdf': 'What significant milestone did the California State University system achieve by partnering with OpenAI, making it the first of its kind in the United States?', '1,000 Scientist AI Jam Session _ OpenAI.pdf': 'What was the specific AI model used during the "1,000 Scientist AI Jam Session" event across the nine national labs?', 'Announcing The Stargate Project _ OpenAI.pdf': 'What are the initial equity funders and lead partners in The Stargate Project announced by OpenAI, and who holds the financial and operational responsibilities?', 'Introducing Operator _ OpenAI.pdf': 'What is the name of the new model that powers the Operator agent introduced by OpenAI?', 'Introducing NextGenAI _ OpenAI.pdf': 'What major initiative did OpenAI launch on March 4, 2025, and which research institution from Europe is involved as a founding partner?', 'Introducing the Intelligence Age _ OpenAI.pdf': "What is the name of the video generation tool used by OpenAI's creative team to help produce their Super Bowl ad?", 'Operator System Card _ OpenAI.pdf': 'What is the preparedness score for the "Cybersecurity" category according to the Operator System Card?', 'Strengthening America’s AI leadership with the U.S. National Laboratories _ OpenAI.pdf': "What is the purpose of OpenAI's agreement with the U.S. National Laboratories as described in the document?", 'OpenAI GPT-4.5 System Card _ OpenAI.pdf': 'What is the Preparedness Framework rating for "Cybersecurity" for GPT-4.5 according to the system card?', 'Partnering with Axios expands OpenAI’s work with the news industry _ OpenAI.pdf': "What is the goal of OpenAI's new content partnership with Axios as announced in the document?", 'OpenAI and Guardian Media Group launch content partnership _ OpenAI.pdf': 'What is the main purpose of the partnership between OpenAI and Guardian Media Group announced on February 14, 2025?', 'Introducing GPT-4.5 _ OpenAI.pdf': 'What is the release date of the GPT-4.5 research preview?', 'Introducing data residency in Europe _ OpenAI.pdf': 'What are the benefits of data residency in Europe for new ChatGPT Enterprise and Edu customers according to the document?', 'The power of personalized AI _ OpenAI.pdf': 'What is the purpose of the "Model Spec" document published by OpenAI for ChatGPT?', 'Disrupting malicious uses of AI _ OpenAI.pdf': "What is OpenAI's mission as stated in the document?", 'Sharing the latest Model Spec _ OpenAI.pdf': 'What is the release date of the latest Model Spec mentioned in the document?', 'Deep research System Card _ OpenAI.pdf': "What specific publication date is mentioned in the Deep Research System Card for when the report on deep research's preparedness was released?", 'Bertelsmann powers creativity and productivity with OpenAI _ OpenAI.pdf': 'What specific AI-powered solutions is Bertelsmann planning to implement for its divisions RTL Deutschland and Penguin Random House according to the document?', 'OpenAI’s Economic Blueprint _ OpenAI.pdf': 'What date and location is scheduled for the kickoff event of OpenAI\'s "Innovating for America" initiative as mentioned in the Economic Blueprint document?', 'Introducing deep research _ OpenAI.pdf': 'What specific model powers the "deep research" capability in ChatGPT that is discussed in this document, and what are its main features designed for?'}

We now have a dictionary of filename:question that we can loop through and ask gpt-4o(-mini) about without providing the document, and gpt-4o should be able to find the relevant document in the Vector Store.

We'll convert our dictionary into a dataframe and process it using gpt-4o-mini. We will look out for the expected file

rows = []for filename, query in questions_dict.items():    rows.append({"query": query, "_id": filename.replace(".pdf", "")})# Metrics evaluation parametersk = 5total_queries = len(rows)correct_retrievals_at_k = 0reciprocal_ranks = []average_precisions = []def process_query(row):    query = row['query']    expected_filename = row['_id'] + '.pdf'    # Call file_search via Responses API    response = client.responses.create(        input=query,        model="gpt-4o-mini",        tools=[{            "type": "file_search",            "vector_store_ids": [vector_store_details['id']],            "max_num_results": k,        }],        tool_choice="required" # it will force the file_search, while not necessary, it's better to enforce it as this is what we're testing    )    # Extract annotations from the response    annotations = None    if hasattr(response.output[1], 'content') and response.output[1].content:        annotations = response.output[1].content[0].annotations    elif hasattr(response.output[1], 'annotations'):        annotations = response.output[1].annotations    if annotations is None:        print(f"No annotations for query: {query}")        return False, 0, 0    # Get top-k retrieved filenames    retrieved_files = [result.filename for result in annotations[:k]]    if expected_filename in retrieved_files:        rank = retrieved_files.index(expected_filename) + 1        rr = 1 / rank        correct = True    else:        rr = 0        correct = False    # Calculate Average Precision    precisions = []    num_relevant = 0    for i, fname in enumerate(retrieved_files):        if fname == expected_filename:            num_relevant += 1            precisions.append(num_relevant / (i + 1))    avg_precision = sum(precisions) / len(precisions) if precisions else 0        if expected_filename not in retrieved_files:        print("Expected file NOT found in the retrieved files!")            if retrieved_files and retrieved_files[0] != expected_filename:        print(f"Query: {query}")        print(f"Expected file: {expected_filename}")        print(f"First retrieved file: {retrieved_files[0]}")        print(f"Retrieved files: {retrieved_files}")        print("-" * 50)            return correct, rr, avg_precision

Recall & Precision are at 1 for this example, and our file ranked first so we're having a MRR and MAP = 1 on this example.

We can now execute this processing on our set of questions.

with ThreadPoolExecutor() as executor:    results = list(tqdm(executor.map(process_query, rows), total=total_queries))correct_retrievals_at_k = 0reciprocal_ranks = []average_precisions = []for correct, rr, avg_precision in results:    if correct:        correct_retrievals_at_k += 1    reciprocal_ranks.append(rr)    average_precisions.append(avg_precision)recall_at_k = correct_retrievals_at_k / total_queriesprecision_at_k = recall_at_k  # In this context, same as recallmrr = sum(reciprocal_ranks) / total_queriesmap_score = sum(average_precisions) / total_queries
 62%|███████████████████▏           | 13/21 [00:07<00:03,  2.57it/s]
Expected file NOT found in the retrieved files!Query: What is OpenAI's mission as stated in the document?Expected file: Disrupting malicious uses of AI _ OpenAI.pdfFirst retrieved file: Introducing the Intelligence Age _ OpenAI.pdfRetrieved files: ['Introducing the Intelligence Age _ OpenAI.pdf']--------------------------------------------------
 71%|██████████████████████▏        | 15/21 [00:14<00:06,  1.04s/it]
Expected file NOT found in the retrieved files!Query: What is the purpose of the "Model Spec" document published by OpenAI for ChatGPT?Expected file: The power of personalized AI _ OpenAI.pdfFirst retrieved file: Sharing the latest Model Spec _ OpenAI.pdfRetrieved files: ['Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf']--------------------------------------------------
100%|███████████████████████████████| 21/21 [00:15<00:00,  1.38it/s]

The outputs logged above would either show that a file wasn't ranked first when our evaluation dataset expected it to rank first or that it wasn't found at all. As we can see from our imperfect evaluation dataset, some questions were generic and expected another doc, which our retrieval system didn't specifically retrieved for this question.

# Print the metrics with kprint(f"Metrics at k={k}:")print(f"Recall@{k}: {recall_at_k:.4f}")print(f"Precision@{k}: {precision_at_k:.4f}")print(f"Mean Reciprocal Rank (MRR): {mrr:.4f}")print(f"Mean Average Precision (MAP): {map_score:.4f}")
Metrics at k=5:Recall@5: 0.9048Precision@5: 0.9048Mean Reciprocal Rank (MRR): 0.9048Mean Average Precision (MAP): 0.8954

With this cookbook we were able to see how to:

    Generate a dataset of evaluations using PDF context-stuffing (leveraging vision modality of 4o) and traditional PDF readersCreate a vector store and populate it with PDFGet an LLM answer to a query, leveraging a RAG system available out-of-the-box with file_search tool call in OpenAI's Response APIUnderstand how chunks of texts are retrieved, ranked and used as part of the Response APIMeasure accuracy, precision, retrieval, MRR and MAP on the dataset of evaluations previously generated

By using file search with Responses, you can simplify RAG architecture and leverage this in a single API call using the new Responses API. File storage, embeddings, retrieval all integrated in one tool!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI API PDF检索 向量存储 文件搜索 知识库
相关文章