未知数据源 2024年11月26日
Machine Reading at Scale – Transfer Learning for Large Text Corpuses
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了机器阅读理解在规模化文本语料库上的应用,以及其在构建企业级问答系统中的重要性。文章介绍了机器阅读理解规模化(MRS)的概念,并以DrQA模型为例,展示了如何利用深度学习技术构建MRS系统。通过在古腾堡儿童书籍语料库上的实验,文章说明了MRS在回答事实性问题方面的有效性,但也指出了其在处理非事实性问题上的局限性,为企业构建基于大规模文本的问答系统提供了参考。

🤔 **机器阅读理解规模化(MRS)**:相较于传统的机器阅读理解(MRC)仅关注较小的文本片段,MRS旨在处理大规模的文本语料库,例如维基百科或企业内部的文档,以构建开放域的问答系统。

📚 **DrQA模型**:作为MRS的一个典型应用,DrQA模型结合了信息检索和机器阅读理解技术,通过搜索和理解文本,从大规模语料库中提取答案。它使用Bigram哈希和TF-IDF匹配进行文档检索,并利用多层循环神经网络模型进行答案识别。

⚙️ **古腾堡儿童书籍语料库实验**:文章通过在古腾堡儿童书籍语料库上进行实验,评估了DrQA模型的性能。实验结果表明,DrQA模型在回答事实性问题(例如“谁是霍格沃茨的校长?”)方面表现良好,但对于非事实性问题(例如“为什么?”)则效果较差。

🖥️ **深度学习虚拟机(DLVM)**:文章使用了DLVM作为训练和测试DrQA模型的计算环境,DLVM是DSVM的一种变体,专门针对深度学习模型训练进行了优化,支持GPU加速。

💡 **MRS在企业中的应用**:MRS技术可以应用于企业级聊天机器人、客户服务等场景,帮助企业构建能够理解复杂语句并从大量文本数据中提取信息的智能系统。

This post is authored by Anusua Trivedi, Senior Data Scientist at Microsoft.

This post builds on the MRC Blog where we discussed how machine reading comprehension (MRC) can help us “transfer learn” any text. In this post, we introduce the notion of and the need for machine reading at scale, and for transfer learning on large text corpuses.

Introduction

Machine reading for question answering has become an important testbed for evaluating how well computer systems understand human language. It is also proving to be a crucial technology for applications such as search engines and dialog systems. The research community has recently created a multitude of large-scale datasets over text sources including:

These new datasets have, in turn, inspired an even wider array of new question answering systems.

In the MRC blog post, we trained and tested different MRC algorithms on these large datasets. We were able to successfully transfer learn smaller text excepts using these pretrained MRC algorithms. However, when we tried creating a QA system for the Gutenberg book corpus (English only) using these pretrained MRC models, the algorithms failed. MRC usually works on text excepts or documents but fails for larger text corpuses. This leads us to a newer concept – machine reading at scale (MRS). Building machines that can perform machine reading comprehension at scale would be of great interest for enterprises.

Machine Reading at Scale (MRS)

Instead of focusing on only smaller text excerpts, Danqi Chen et al. came up with a solution to a much bigger problem which is machine reading at scale. To accomplish the task of reading Wikipedia to answer open-domain questions, they combined a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs.

MRC is about answering a query about a given context paragraph. MRC algorithms typically assume that a short piece of relevant text is already identified and given to the model, which is not realistic for building an open-domain QA system.

In sharp contrast, methods that use information retrieval over documents must employ search as an integral part of the solution.

MRS strikes a balance between the two approaches. It is focused on simultaneously maintaining the challenge of machine comprehension, which requires the deep understanding of text, while keeping the realistic constraint of searching over a large open resource.

Why is MRS Important for Enterprises?

The adoption of enterprise chatbots has been rapidly increasing in recent times. To further advance these scenarios, research and industry has turned toward conversational AI approaches, especially in use cases such as banking, insurance and telecommunications, where there are large corpuses of text logs involved.

One of the major challenges for conversational AI is to understand complex sentences of human speech in the same way humans do. The challenge becomes more complex when we need to do this over large volumes of text. MRS can address both these concerns where it can answer objective questions from a large corpus with high accuracy. Such approaches can be used in real-world applications like customer service.

In this post, we want to evaluate the MRS approach to solve automatic QA capability across different large corpuses.

Training MRS – DrQA Model

DrQA is a system for reading comprehension applied to open-domain question answering. DrQA is specifically targeted at the task of machine reading at scale. In this setting, we are searching for an answer to a question in a potentially very large corpus of unstructured documents (which may not be redundant). Thus, the system must combine the challenges of document retrieval (i.e. finding relevant documents) with that of machine comprehension of text (identifying the answers from those documents).

We use Deep Learning Virtual Machine (DLVM) as the compute environment with two NVIDIA Tesla P100 GPU, CUDA and cuDNN libraries. The DLVM is a specially configured variant of the Data Science Virtual Machine (DSVM) that makes it more straightforward to use GPU-based VM instances for training deep learning models. It is supported on Windows 2016 and the Ubuntu Data Science Virtual Machine. It shares the same core VM images – and hence the same rich toolset – as the DSVM, but is configured to make deep learning easier. All the experiments were run on a Linux DLVM with two NVIDIA Tesla P100 GPUs. We use the PyTorch backend to build the models. We pip installed all the dependencies in the DLVM environment.

We fork the Facebook Research GitHub for our blog work and we train the DrQA model on SQUAD dataset. We use the pre-trained MRS model for evaluating our large Gutenberg corpuses using transfer learning techniques.

Children's Gutenberg Corpus

We created a Gutenberg corpus consisting of about 36,000 English books. We then created a subset of Gutenberg corpus consisting of 528 children’s books.

Pre-processing the children’s Gutenberg dataset:

How to Create a Custom Corpus for DrQA to Work?

We follow the instructions available here to create a compatible document retriever for the Gutenberg Children’s books.

To execute the DrQA model:

The pipeline returns the most probable answer list from the top three most matched documents.

We then run the interactive pipeline using this trained DrQA model to test the Gutenberg Children’s Book Corpus.

For environment setup, please follow ReadMe.md in GitHub to download the code and install dependencies. For all code and related details, please refer to our GitHub link here.

MRS Using DLVM

Please follow similar steps listed in this notebook to test the DrQA model on DLVM.

Learnings from Our Evaluation Work

In this post, we investigated the performance of the MRS model on our own custom dataset. We tested the performance of the transfer learning approach for creating a QA system for around 528 children’s books from the Project Gutenberg Corpus using the pretrained DrQA model. Our evaluation results are captured in the exhibits below and in the explanation that follows. Note that these results are particular to our evaluation scenario – results will vary for other documents or scenarios.

In the above examples, we tried questions beginning with What, How, Who, Where and Why – and there’s an important aspect about MRS that is worth noting, namely:

The green box represents the correct answer for each question. As we see here, for factoid questions, the answers chosen by the MRS model are in line with the correct answer. In the case of the non-factoid “Why” question, however, the correct answer is the third one, and it’s the only one that makes any sense.

Overall, our evaluation scenario shows that for generic large document corpuses, the DrQA model does a good job of answering factoid questions. Anusua @anurive  |  Email Anusua at u>antriv@microsoft.com</u for questions pertaining to this post.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器阅读理解 规模化 MRS DrQA 问答系统
相关文章