C3.AI 06月25日 15:44
Building Performant LLM Applications
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了C3 Generative AI的设计和部署过程,这是一个企业级AI应用,旨在帮助用户通过自然语言界面检索、交互和推理来自不同企业和外部来源系统的信息,例如文档、模型和工程/物理模拟。文章重点介绍了在不同用例中部署C3 Generative AI的工程设计流程,以确保高性能和快速的成果。通过构建特定领域的评估基准,并量化应用在各种特定于客户的任务中的性能,C3 Generative AI满足了企业对生成式AI和LLMs的需求,并提供了快速获得结果的高性能系统。

💡 **评估数据集构建:** 为了衡量C3 Generative AI应用的价值,文章强调构建特定领域的评估基准的重要性,通过创建包含答案引用的领域特定问题,从而实现对RAG(Retrieval-Augmented Generation)流程的严格评估,确保检索、生成和端到端质量。

🔍 **自动元数据提取:** 通过基于LLM的文档解析,自动识别特定领域的元数据标签(文档类型、作者、发布日期等),从而提高信息检索效率。在实际应用中,C3 Generative AI在制造业用例中实现了93%的整体性能提升。

🚀 **检索模型基准测试:** 检索是RAG流程中最关键的组成部分,提高检索质量对整体性能至关重要。通过评估相关段落的召回率来衡量检索质量,从而确保RAG流程能够有效地捕获必要的上下文信息。在工业领域,C3 AI RAG在75个事实性问答(包含20个旨在误导模型的无法回答的问题)的异构数据集上实现了91%的准确率。

⚙️ **生成器配置调优:** 通过对提示工程和解码策略的调整,优化语言模型的输出质量。在对45个高度复杂的分析推理问答进行模型调优后,C3 AI RAG实现了87%的整体准确率和93%的正确识别无法回答问题的能力。

✅ **自动评估流程:** 为了加速开发迭代和实现持续评估,文章采用了自动评估框架。该框架基于预定义的标准衡量性能,并与人工评估结果进行比较,验证LLM评分结果的准确性。实验结果表明,该框架在评估答案质量方面具有很高的可靠性,加权F1分数达到89%。

Introduction

C3 Generative AI is an enterprise AI application that helps users retrieve, interact with, and reason with information across disparate enterprise and external source systems, from documents and models to engineering/physics simulations through a natural language interface. This blog highlights the engineering design process for deploying C3 Generative AI to novel use-cases, ensuring high performance and rapid time-to-results.

The use-cases we’ll explore in this article are:

    Evaluation Dataset Curation Automatic Metadata Extraction Retriever Model Benchmarking Generator Configuration Tuning Automatic Evaluation Pipeline Hyperparameter Sweep

 

Business Context

Knowledge workers can spend up to 20% of their time searching for critical information and two-thirds of enterprise data is never utilized, negatively impacting enterprise productivity [2, 3].
One of the biggest challenges for enterprises to adopt generative AI and LLMs is trust.
C3 Generative AI provides a path to a high-performance system with rapid time-to-results which satisfies this enterprise requirement.

 

RAG Engineering Design Process

We demonstrate of each of the steps of the design process supported by experimental results of the success achieved by leveraging each of these steps:

    Evaluation Dataset Generation
    We identified over dozens of customer deployments that the best way to communicating the value of C3 Generative AI application was through building domain-specific evaluation benchmarks and explicitly quantifying the performance of the application over a range of customer-specific tasks.
    We leverage an evolutionary question generation framework capable of creating domain-specific questions with answer citations for specific passages in documents. This is crucial for enabling a robust evaluation of our RAG pipeline by rigorously benchmarking retrieval, generative, & end-to-end quality. Additionally, the sample queries serve as a strong primer for users to understand how to best leverage the application, shown in Figure 4.

    We build a sophisticated approach to tackle the generation of more difficult & complex styles of questions that require strong contextual reasoning skills to resolve, as shown in Figure 5.
    Classes of Difficult to Generate Questions Cross-Document & Multi-hop Questions

      Analytical (How has, What is impact, Why) Predictive (What might, How should) Conditional (Given X can, How to X so that Y) Hypothetical (If, Assuming) Compare & Contrast (Between X & Y, Explain the differences) Overview (Describe, Summarize, Generate)
    https://c3.ai/wp-content/uploads/2024/08/building-performant-llm-applications-image4.png-performant-llm-applications-image4.png" alt="" class="alignnone size-full">

    We are able to construct difficult to answer questions that contain multiple parts that logically depend on one another to answer successfully. These closely mirror the expectations of users, particularly in highly analytical use-cases, like Table 1 shown below for Financial services, where analysts need to comb through a large volume of text to build a comprehensive answer to a complex query.

    Table 1: Synthetically Curated Advanced Questions in a Financial Services Use-Case

    Types of Questions Questions
    Overview What are the main products of X from the discussion? Describe an overview of the competitive landscape of each of X’s products.
    Analytical
    Why did X struggle to establish presence in the market? How could/might X’s acquisition decisions impact its future growth?
    What are some of the implications of the North America office being perceived as lagging behind EU and Asia?
    Comparative What has been the user feedback on X’s cloud product, and how has this evolved over the last four years? What are key strengths and weaknesses and how do they compare with customers stated desires from the product?
    Generative Generate a 1 page email advocating for investment into company X. Explain the difference between X and Y in terms of reputation and their pretense in the market.

    Another import element of production use-cases is dealing with unanswerable questions. These are seemingly plausible yet tricky queries that may inadvertently induce a hallucination within the language model, as shown in Table 2. Our goal with these questions to is to study the rate at which the LLM correctly identifies it cannot respond to the query due to the lack of correct information. We only seek to provide responses that are factually grounded in the underlying document corpus.

    Table 2: Examples of Unanswerable Questions in the Financial Services Use-case

    What specific event occurred in the last quarter that significantly altered the perception of auditing systems in the industry, leading to a market shift in digital transformation strategies?
    Can you determine the subconscious or unknown factors significantly influence consumer’s decisions when choosing between X and Y’s core offerings, based on their hidden preferences?
    What would a conversation about market collusion or monopolization between a X and Y sound like if they discussed their competitive advantages and disadvantages in terms of customer preferences and technical capabilities?
    Automatic Metadata Extraction
    In this step, we leverage LLM-based document parsing to identify domain-specific metadata tags (Document Type, Author, Published Date, etc.) in an automatic ingestion pipeline. Real-world performance is shown in Figure 3, highlighting a deployment of C3 Generative AI in a Manufacturing use-case which turbocharges performance to an overall 93% on 70 factual Q&A. In comparison, directly asking domain-specific questions to ChatGPT [4] results in a much weaker ovehttps://c3.ai/wp-content/uploads/2024/08/building-performant-llm-applications-image5.png4/08/building-performant-llm-applications-image5.png" alt="" class="alignnone size-full">

    Retriever Model Benchmarking
    Retrieval is the most upstream component of the RAG pipeline & improving its quality precipitates compounding returns in overall performance. Poor retrieval can lead to the selection of irrelevant or less useful passages, which can degrade the performance of the entire system. One way to measure the quality of retrieval is by evaluating the recall of relevant passages—how well the retrieval component can identify and return all the pertinent information from the corpus. High recall indicates that the retrieval process is effectively capturing the necessary context, which is essential for the RAG pipeline to perform optimally.

    The value of solving information retrieval cannot be understated, and is shown clearly in Figure 6, showcasing the results from a deployment in an Industrials use-case. It highlights the performance gains on a heterogeneous dataset of 75 factual Q&A (with 20 curated unanswerable questions meant to trick the model), C3 AI RAG achieves 91% total human-evaluated accuracy and 90% correctness identifying questions it cannot answer, meaning it avoids hallucinating on 18 of https://c3.ai/wp-content/uploads/2024/08/building-performant-llm-applications-image6.pngt/uploads/2024/08/building-performant-llm-applications-image6.png" alt="" class="alignnone size-full">

    Generator Configuration Tuning
    In many cases, once retrieval performance is optimized, the application is generally performant. However, in certain deployments, further work is done to improve the quality of output generations from the language model. This involves prompt engineering [5] & ablating across decoding strategies (Beam search, Nucleus sampling, Multinomial Sampling, etc.) [6] to identify a configuration that maximizes success with the LLM generator. In Figure 7, we showcase the results after model tuning on a synthetic dataset of 45 highly complex analytical reasoning Q&A, C3 AI RAG achieves 87% total accuracy and 93% correctness ihttps://c3.ai/wp-content/uploads/2024/08/building-performant-llm-applications-image7.pngc="/wp-content/uploads/2024/08/building-performant-llm-applications-image7.png" alt="" class="alignnone size-full">

    Automatic Evaluation
    To speed up development iterations and enable continuous evaluation in deployment, we leverage an automatic evaluation framework that is able to measure the performance according the rubric defined in Table 3.

    Table 3: Rubric for Grading Accuracy & Hallucination in RAG

    Case Description Class Score
    Correct Answer for Answerable Question If the ground truth answer has an answer and the generated answer aligns with the ground truth answer. Correct 1
    Wrong Answer for Answerable Question If the ground truth answer. has an answer and the generated answer says the wrong answer, mark it as -1. Hallucination -1
    No Answer for Answerable Question If the ground truth answer has an answer and the generated answer says I don’t know No Answer 0
    No answer for Unanswerable Question If the ground truth answer says I don’t know and the generated answer also says there is no answer Correct 1
    Wrong Answer for Unanswerable Question If the ground truth answer says I don’t know and the generated answer says there is an answer, mark it as -1. Hallucination -1

    To validate the correctness of LLM scored results, we perform an experimental study of the automatic evaluation framework in comparison with human-grading of content. The methodology involves building 70 questions that are selected from a random subset of passages / tables from 47 documents. Each question cites a specific passage in the corresponding document. The scores from a LLM judge are compared with a human scoring along our two question classes: Answerable and Unanswerable questions. The results in Figure 8 demonstrate that we achieve a strong weighted F1 Score of 89% showing it is a rhttps://c3.aihttps://c3.ai/wp-content/uploads/2024/08/building-performant-llm-applications-image8.pngng="async" src="https://c3.ai/wp-content/uploads/2024/08/building-performant-llm-applications-image8.png" alt="" class="alignnone size-full">

    Pipeline Hyperparameter Sweep
    The final technique leveraged is to further drive performance of the application is performing a hyperparameter sweep across pipeline configurations to identify the most optimal setup for the Generative AI app, as shown in Figure 9. This is possible since we have designed a dataset from our dataset generation step that contains labelshttps://c3.aihttps://c3.ai/wp-content/uploads/2024/08/building-performant-llm-applications-image9.pngp>

    This allows us to identify potentially well-suited language models for specific deployments, an example is shown below in Table 3, which finds that GPT-3.5 & GPT-4 excel in this deployment for a State & Local Government use-case.

    Table 3: Results from LLM Ablation in a State & Local Government Deployment

    Model Name

    Claude v1 (AWS)

    Claude v2 (AWS)

    GPT-3.5 (Azure)

    GPT-4 (Azure)

    Overall Metrics on 80 Questions

    Accuracy

    61.7%

    53.1%

    75.3%

    79.0%

    Hallucination

    22.2%

    24.7%

    19.8%

    11.1%

    No Answer

    16.1%

    22.2%

    4.9%

    8.6%

    50 Answerable Questions

    Accuracy

    66.0%

    68.0%

    64.0%

    78.0%

    Hallucination

    26.0%

    8.0%

    28.0%

    16.0%

    No Answer

    8.0%

    24.0%

    8.0%

    6.0%

    30 Unanswerable Questions

    Accuracy (No Answer)

    53.3%

    26.7%

    93.3%

    83.3%

    Hallucination

    46.7%

    73.3%

    6.7%

    16.7%

 

Conclusion

Through C3 Generative AI’s rigorous engineering design process, we can construct & optimize a complex LLM-based AI pipeline to achieves strong performance across a wide-range of sectors & heterogeneous Generative AI use-cases. This process has enabled C3 AI to achieve over dozens of successful production deployments of its Generative AI application. As shown in Figure 10, these tools can be leveraged by customer developers and data scientists to rapidly https://c3.aihttps://c3.ai/wp-content/uploads/2024/08/building-performant-llm-applications-image10.pngonths.

 

Author

Sravan Jayanthi Romain Juban

References

[1] https://c3.ai/c3-generative-ai-getting-the-most-out-of-enterprise-data/
[2] https://www.mckinsey.com/~/media/McKinsey/Industries/Technology Media and Telecommunications/High Tech/Our Insights/The social economy/MGI_The_social_economy_Full_report.ashx
[3] https://www.alteryx.com/
[4] https://openai.com/index/chatgpt/
[5] https://platform.openai.com/docs/guides/prompt-engineering
[6] https://deci.ai/blog/from-top-k-to-beam-search-llm-decoding-strategies/

 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

C3 Generative AI RAG LLM 企业AI
相关文章