MarkTechPost@AI 04月05日 09:19
Building Your AI Q&A Bot for Webpages Using Open Source AI Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍如何利用开源AI模型构建一个网页内容问答系统。该系统基于Hugging Face的开源模型,可在Google Colab上运行,无需本地设置,且完全免费。通过输入URL,系统能够提取网页内容,并根据用户提出的问题提供准确的答案。文章详细介绍了环境设置、库的导入、模型的加载、问答功能的实现以及测试示例。读者将学习如何利用现代NLP技术,更有效地从在线内容中提取信息。

💻 准备工作:首先,需要在Google Colab上创建一个新的笔记本,并安装必要的库,包括transformers、torch、beautifulsoup4和requests,这些库分别用于NLP模型、深度学习框架、HTML解析和网络请求。

⚙️ 功能构建:核心在于定义一个函数,该函数接收URL作为输入,提取网页内容,并使用预训练的问答模型来回答用户提出的问题。该函数处理长内容时会将其分割成块,以适应模型的上下文长度限制。

🤖 模型选择:文章使用了deepset/roberta-base-squad2模型,该模型基于RoBERTa架构,并在SQuAD 2.0数据集上进行了微调,在准确性和速度之间取得了良好的平衡。

💡 测试与改进:文章提供了测试示例,展示了系统如何回答关于人工智能的常见问题。此外,文章还指出了系统的局限性,如对超长网页的处理、对复杂问题的理解等,并提出了未来改进方向,如语义搜索、文档摘要、多语言支持等。

In today’s information-rich digital landscape, navigating extensive web content can be overwhelming. Whether you’re researching for a project, studying complex material, or trying to extract specific information from lengthy articles, the process can be time-consuming and inefficient. This is where an AI-powered Question-Answering (Q&A) bot becomes invaluable.

This tutorial will guide you through building a practical AI Q&A system that can analyze webpage content and answer specific questions. Instead of relying on expensive API services, we’ll utilize open-source models from Hugging Face to create a solution that’s:

By the end of this tutorial, you’ll have a functional web Q&A system that can help you extract insights from online content more efficiently.

What We’ll Build

We’ll create a system that:

    Takes a URL as inputExtracts and processes the webpage contentAccepts natural language questions about the contentProvides accurate, contextual answers based on the webpage

Prerequisites

Step 1: Setting Up the Environment

First, let’s create a new Google Colab notebook. Go to Google Colab and create a new notebook.

Let’s start by installing the necessary libraries:

# Install required packages

!pip install transformers torch beautifulsoup4 requests

This installs:

Step 2: Import Libraries and Set Up Basic Functions

Now let’s import all the necessary libraries and define some helper functions:

import torchfrom transformers import AutoModelForQuestionAnswering, AutoTokenizerimport requestsfrom bs4 import BeautifulSoupimport reimport textwrap

# Check if GPU is available

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')print(f"Using device: {device}")

# Function to extract text from a webpage

def extract_text_from_url(url):   try:       headers = {           'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'       }       response = requests.get(url, headers=headers)       response.raise_for_status()         soup = BeautifulSoup(response.text, 'html.parser')       for script_or_style in soup(['script', 'style', 'header', 'footer', 'nav']):           script_or_style.decompose()       text = soup.get_text()       lines = (line.strip() for line in text.splitlines())       chunks = (phrase.strip() for line in lines for phrase in line.split("  "))       text = 'n'.join(chunk for chunk in chunks if chunk)       text = re.sub(r's+', ' ', text).strip()       return text   except Exception as e:       print(f"Error extracting text from URL: {e}")       return None

This code:

    Imports all necessary librariesSets up our device (GPU if available, otherwise CPU)Creates a function to extract readable text content from a webpage URL

Step 3: Load the Question-Answering Model

Now let’s load a pre-trained question-answering model from Hugging Face:

# Load pre-trained model and tokenizer

model_name = "deepset/roberta-base-squad2"print(f"Loading model: {model_name}")tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForQuestionAnswering.from_pretrained(model_name).to(device)print("Model loaded successfully!")

We’re using deepset/roberta-base-squad2, which is:

Step 4: Implement the Question-Answering Function

Now, let’s implement the core functionality – the ability to answer questions based on the extracted webpage content:

def answer_question(question, context, max_length=512):   max_chunk_size = max_length - len(tokenizer.encode(question)) - 5     all_answers = []   for i in range(0, len(context), max_chunk_size):       chunk = context[i:i + max_chunk_size]       inputs = tokenizer(           question,           chunk,           add_special_tokens=True,           return_tensors="pt",           max_length=max_length,           truncation=True       ).to(device)       with torch.no_grad():           outputs = model(**inputs)       answer_start = torch.argmax(outputs.start_logits)       answer_end = torch.argmax(outputs.end_logits)       start_score = outputs.start_logits[0][answer_start].item()       end_score = outputs.end_logits[0][answer_end].item()       score = start_score + end_score       input_ids = inputs.input_ids.tolist()[0]       tokens = tokenizer.convert_ids_to_tokens(input_ids)       answer = tokenizer.convert_tokens_to_string(tokens[answer_start:answer_end+1])       answer = answer.replace("[CLS]", "").replace("[SEP]", "").strip()       if answer and len(answer) > 2:             all_answers.append((answer, score))   if all_answers:       all_answers.sort(key=lambda x: x[1], reverse=True)       return all_answers[0][0]   else:       return "I couldn't find an answer in the provided content."

This function:

    Takes a question and the webpage content as inputHandles long content by processing it in chunksUses the model to predict the answer span (start and end positions)Processes multiple chunks and returns the answer with the highest confidence score

Step 5: Testing and Examples

Let’s test our system with some examples. Here’s the complete code:

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"webpage_text = extract_text_from_url(url)print("Sample of extracted text:")print(webpage_text[:500] + "...")questions = [   "When was the term artificial intelligence first used?",   "What are the main goals of AI research?",   "What ethical concerns are associated with AI?"]for question in questions:   print(f"nQuestion: {question}")   answer = answer_question(question, webpage_text)   print(f"Answer: {answer}")

This will demonstrate how the system works with real examples.

Output of the above code

Limitations and Future Improvements

Our current implementation has some limitations:

    It can struggle with very long webpages due to context length limitationsThe model may not understand complex or ambiguous questionsIt works best with factual content rather than opinions or subjective material

Future improvements could include:

Conclusion

Now you’ve successfully built your AI-powered Q&A system for webpages using open-source models. This tool can help you:

By utilizing Hugging Face’s powerful models and the flexibility of Google Colab, you’ve created a practical application that demonstrates the capabilities of modern NLP. Feel free to customize and extend this project to meet your specific needs.

Useful Resources


Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post Building Your AI Q&A Bot for Webpages Using Open Source AI Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 问答系统 NLP Hugging Face
相关文章