AWS Machine Learning Blog 2024年11月28日
Search enterprise data assets using LLMs backed by knowledge graphs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种基于生成式AI的语义搜索解决方案,旨在帮助企业用户快速准确地查找各种企业数据源中的相关数据资产。该方案将亚马逊Bedrock上的大型语言模型(LLM)与基于Amazon Neptune构建的知识图谱相结合,创建了一个强大的搜索范例,允许用户使用自然语言查询跨Amazon S3存储的文档、AWS Glue Data Catalog中的数据湖表以及Amazon DataZone中的企业资产进行搜索。通过将知识图谱与基础模型相结合,该方案能够利用基础模型的归纳能力,同时将其语言理解和生成能力建立在结构良好的领域知识和逻辑推理基础上,从而实现更准确、更具上下文关联性和洞察力的搜索结果,提升企业决策效率和创新能力。

🤔 **挑战:**企业数据资产分散在各种来源,传统搜索方法难以提供全面且符合上下文的搜索结果,尤其对于非结构化数据或复杂查询。

💡 **解决方案:**集成大型语言模型(LLM)和知识图谱,构建一个基于自然语言的语义搜索平台,能够理解查询的意图和上下文,搜索跨越不同数据源(Amazon S3、AWS Glue Data Catalog、Amazon DataZone等)的数据资产。

⚙️ **核心技术:**利用Amazon Bedrock上的基础模型(FM)进行文本和语言处理,并结合Amazon Neptune构建的知识图谱,将领域知识融入到FM中,增强推理和逻辑能力。

🔄 **数据摄取:**构建一个数据摄取管道,从Amazon DataZone、AWS Glue、Amazon Athena等服务中提取元数据,将其转换为RDF格式存储在Amazon Neptune数据库中,并转换为文本存储在S3中,供Amazon Bedrock的知识库使用。

🖥️ **应用界面:**使用Streamlit构建一个聊天机器人界面,用户可以通过自然语言查询搜索数据资产,并获取相关结果。

Enterprises are facing challenges in accessing their data assets scattered across various sources because of increasing complexities in managing vast amount of data. Traditional search methods often fail to provide comprehensive and contextual results, particularly for unstructured data or complex queries.

Search solutions in modern big data management must facilitate efficient and accurate search of enterprise data assets that can adapt to the arrival of new assets. Customers want to search through all of the data and applications across their organization, and they want to see the provenance information for all of the documents retrieved. The application needs to search through the catalog and show the metadata information related to all of the data assets that are relevant to the search context. To accomplish all of these goals, the solution should include the following features:

In this post, we present a generative AI-powered semantic search solution that empowers business users to quickly and accurately find relevant data assets across various enterprise data sources. In this solution, we integrate large language models (LLMs) hosted on Amazon Bedrock backed by a knowledge base that is derived from a knowledge graph built on Amazon Neptune to create a powerful search paradigm that enables natural language-based questions to integrate search across documents stored in Amazon Simple Storage Service (Amazon S3), data lake tables hosted on the AWS Glue Data Catalog, and enterprise assets in Amazon DataZone.

Foundation models (FMs) on Amazon Bedrock provide powerful generative models for text and language tasks. However, FMs lack domain-specific knowledge and reasoning capabilities. Knowledge graphs available on Neptune provide a means to represent interconnected facts and entities with inferencing and reasoning abilities for domains. Equipping FMs with structured reasoning abilities using domain-specific knowledge graphs harnesses the best of both approaches. This allows FMs to retain their inductive abilities while grounding their language understanding and generation in well-structured domain knowledge and logical reasoning. In the context of enterprise data asset search powered by a metadata catalog hosted on services such Amazon DataZone, AWS Glue, and other third-party catalogs, knowledge graphs can help integrate this linked data and also enable a scalable search paradigm that integrates metadata that evolves over time.

Solution overview

The solution integrates with your existing data catalogs and repositories, creating a unified, scalable semantic layer across the entire data landscape. When users ask questions in plain English, the search is not just for keywords; it comprehends the query’s intent and context, relating it to relevant tables, documents, and datasets across your organization. This semantic understanding enables more accurate, contextual, and insightful search results, making the entire company’s data as accessible and simple to search as using a consumer search engine, but with the depth and specificity your business demands. This significantly enhances decision-making, efficiency, and innovation throughout your organization by unlocking the full potential of your data assets. The following video shows the sample working solution.

Using graph data processing and the integration of natural language-based search on embedded graphs, these hybrid systems can unlock powerful insights from complex data structures.

The solution presented in this post consists of an ingestion pipeline and a search application UI that the user can submit queries to in natural language while searching for data assets.

The following diagram illustrates the end-to-end architecture, consisting of the metadata API layer, ingestion pipeline, embedding generation workflow, and frontend UI.

The ingestion pipeline (3) ingests metadata (1) from services (2), including Amazon DataZone, AWS Glue, and Amazon Athena, to a Neptune database after converting the JSON response from the service APIs into an RDF triple format. The RDF is converted into text and loaded into an S3 bucket, which is accessed by Amazon Bedrock (4) as the source of the knowledge base. You can extend this solution to include metadata from third-party cataloging solutions as well. The end-users access the application, which is hosted on Amazon CloudFront (5).

A state machine in AWS Step Functions defines the workflow of the ingestion process by invoking AWS Lambda functions, as illustrated in the following figure.

The functions perform the following actions:

    Read metadata from services (Amazon DataZone, AWS Glue, and Athena) in JSON format. Enhance the JSON format metadata to JSON-LD format by adding context, and load the data to an Amazon Neptune Serverless database as RDF triples. The following is an example of RDF triples in N-triples file format:
    <arn:aws:glue:us-east-1:440577664410:table/default/market_sales_table#sales_qty_sold><http://www.w3.org/2000/01/rdf-schema#label> "sales_qty_sold" .<arn:aws:glue:us-east-1:440577664410:table/sampleenv_pub_db/mkt_sls_table#disnt> <http://www.w3.org/2000/01/rdf-schema#label> "disnt" .<arn:aws:glue:us-east-1:440577664410:table/sampleenv_pub_db/mkt_sls_table> <http://www.amazonaws.com/datacatalog/hasColumn> <arn:aws:glue:us-east-1:440577664410:table/sampleenv_pub_db/mkt_sls_table#item_id> .<arn:aws:glue:us-east-1:440577664410:table/sampledata_pub_db/raw_customer> <http://www.w3.org/2000/01/rdf-schema#label> "raw_customer" .

    For more details about RDF data format, refer to the W3C documentation.

    Run SPARQL queries in the Neptune database to populate additional triples from inference rules. This step enriches the metadata by using the graph inferencing and reasoning capabilities. The following is a SPARQL query that inserts new metadata inferred from existing triples:
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>INSERT  {    ?asset <http://www.amazonaws.com/datacatalog/exists_in_aws_account> ?account  }WHERE  {    ?asset <http://www.amazonaws.com/datacatalog/isTypeOf> "GlueTableAssetType" .    ?asset <http://www.amazonaws.com/datacatalog/catalogId> ?account .  }
    Read triples from the Neptune database and convert them into text format using an LLM hosted on Amazon Bedrock. This solution uses Anthropic’s Claude 3 Haiku v1 for RDF-to-text conversion, storing the resulting text files in an S3 bucket.

Amazon Bedrock Knowledge Bases is configured to use the preceding S3 bucket as a data source to create a knowledge base. Amazon Bedrock Knowledge Bases creates vector embeddings from the text files using the Amazon Titan Text Embeddings v2 model.

A Streamlit application is hosted in Amazon Elastic Container Service (Amazon ECS) as a task, which provides a chatbot UI for users to submit queries against the knowledge base in Amazon Bedrock.

Prerequisites

The following are prerequisites to deploy the solution:

Prepare the test data

A sample dataset is needed for testing the functionalities of the solution. In your AWS account, prepare a table using Amazon DataZone and Athena completing Step 1 through Step 8 in Amazon DataZone QuickStart with AWS Glue data. This will create a table and capture its metadata in the Data Catalog and Amazon DataZone.

To test how the solution is combining metadata from different data catalogs, create another table only in the Data Catalog, not in Amazon DataZone. On the Athena console, open the query editor and run the following query to create a new table:

CREATE TABLE raw_customer AS SELECT 203 AS cust_id, 'John Doe' AS cust_name

Deploy the application

Complete the following steps to deploy the application:

    To launch the CloudFormation template, choose Launch Stack or download the template file (yaml) and launch the CloudFormation stack in your AWS account.
    Modify the stack name or leave as default, then choose Next. In the Parameters section, input the Amazon Cognito user pool ID (CognitoUserPoolId) and application client ID (CognitoAppClientId). This is required for successful deployment of the stacks.
    Review and update other AWS CloudFormation parameters if required. You can use the default values for all the parameters and continue with the stack deployment.
    The following table lists the default parameters for the CloudFormation template.

    Parameter Name Description Default Value
    EnvironmentName Unique name to distinguish different web applications in the same AWS account (min length 1 and max length 4). dev
    S3DataPrefixKB S3 object prefix where the knowledge base source documents (metadata files) should be stored. knowledge_base
    Cpu CPU configuration of the ECS task. 512
    Memory Memory configuration of the ECS task. 1024
    ContainerPort Port for the ECS task host and container. 80
    DesiredTaskCount Number of desired ECS task count. 1
    MinContainers Minimum containers for auto scaling. Should be less than or equal to DesiredTaskCount. 1
    MaxContainers Maximum containers for auto scaling. Should be greater than or equal to DesiredTaskCount. 3
    AutoScalingTargetValue CPU utilization target percentage for ECS task auto scaling. 80
    Launch the stack.

The CloudFormation stack creates the required resources to launch the application by invoking a series of nested stacks. It deploys the following resources in your AWS account:

After the CloudFormation stack is deployed, a Step Functions workflow will run automatically that orchestrates the metadata extract, transform, and load (ETL) job, and stores the final results in Amazon S3. View the execution status and details of the workflow by fetching the state machine Amazon Resource Name (ARN) from the CloudFormation stack. If AWS Lake Formation is enabled for the AWS Glue databases and tables in the account, complete the following steps after the CloudFormation stack is deployed to update the permission and extract the metadata details from AWS Glue and update the metadata details to load to the knowledge base:

    Add a role to the AWS Glue Lambda function that grants access to the AWS Glue database.
    Fetch the state machine ARN from the CloudFormation stack.
    Run the state machine with default input values to extract the metadata details and write to Amazon S3.

You can search for the application stack name <MainStackName>-deploy-<EnvironmentName> (for example, mm-enterprise-search-deploy-dev) on the AWS CloudFormation console. Locate the web application URL in the stack outputs (CloudfrontURL). Launch the web application by choosing the URL link.

Use the application

You can access the application from a web browser using the domain name of the Amazon CloudFront distribution created in the deployment steps. Log in using a user credential that exists in the Amazon Cognito user pool.

Now you can submit a query using a text input. The AWS account used in this example contains sample tables related to sales and marketing. We ask the question, “How to query sales data?” The answer includes metadata on the table mkt_sls_table that was created in the previous steps.

We ask another question: “How to get customer names from sales data?” In the previous steps, we created the raw_customer table, which wasn’t published as a data asset in Amazon DataZone. The table only exists in the Data Catalog. The application returns an answer that combines metadata from Amazon DataZone and AWS Glue.

This powerful solution opens up exciting possibilities for enterprise data discovery and insights. We encourage you to deploy it in your own environment and experiment with different types of queries across your data assets. Try combining information from multiple sources, asking complex questions, and see how the semantic understanding improves your search experience.

Clean up

The total cost of running this setup is less than $10 per day. However, we recommend deleting the CloudFormation stack after use because the deployed resources incur costs. Deleting the main stack also deletes all the nested stacks except the VPC because of dependency. You also need to delete the VPC from the Amazon VPC console.

Conclusion

In this post, we presented a comprehensive and extendable multimodal search solution of enterprise data assets. The integration of LLMs and knowledge graphs shows that by combining the strengths of these technologies, organizations can unlock new levels of data discovery, reasoning, and insight generation, ultimately driving innovation and progress across a wide range of domains.

To learn more about LLM and knowledge graph use cases, refer to the following resources:


About the Authors

Sudipta Mitra is a Generative AI Specialist Solutions Architect at AWS, who helps customers across North America use the power of data and AI to transform their businesses and solve their most challenging problems. His mission is to enable customers achieve their business goals and create value with data and AI. He helps architect solutions across AI/ML applications, enterprise data platforms, data governance, and unified search in enterprises.

Gi Kim is a Data & ML Engineer with the AWS Professional Services team, helping customers build data analytics solutions and AI/ML applications. With over 20 years of experience in solution design and development, he has a background in multiple technologies, and he works with specialists from different industries to develop new innovative solutions using his skills. When he is not working on solution architecture and development, he enjoys playing with his dogs at a beach under the San Francisco Golden Gate Bridge.

Surendiran Rangaraj is a Data & ML Engineer at AWS who helps customers unlock the power of big data, machine learning, and generative AI applications for their business solutions. He works closely with a diverse range of customers to design and implement tailored strategies that boost efficiency, drive growth, and enhance customer experiences.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

生成式AI 语义搜索 知识图谱 数据资产 企业搜索
相关文章