MarkTechPost@AI 2024年07月02日
Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨大型语言模型在分类任务中的表现及局限性,包括在无正确标签时的问题,还介绍了新的基准和评估指标

🎯大型语言模型在有正确标签时表现出色,但当正确标签故意缺失时,模型仍会在可能性中选择,这引发对其实际理解和智能的担忧

💡在大型语言模型的背景下,缺乏不确定性带来两个主要问题:一是模型可处理任何标签集,包括准确性存疑的标签,应像人类行为一样识别准确标签或指出其缺失;二是模型主要为生成模型,常放弃判别能力

📋文章提供了三个作为基准的常见分类任务:BANK77意图分类任务、MC-TEST多项选择题回答任务、EQUINFER根据科学论文段落确定正确方程的任务

📈提出新的评估指标OMNIACCURACY,通过结合两个维度的结果来更准确地评估大型语言模型在分类任务中的表现

Large Language Models (LLMs) have shown impressive performance in a range of tasks in recent years, especially classification tasks. These models demonstrate amazing performance when given gold labels or options that include the right answer. A significant limitation is that if these gold labels are purposefully left out, LLMs would still choose among the possibilities, even if none of them are correct. This raises significant concerns regarding these models’ actual comprehension and intelligence in classification scenarios.

In the context of LLMs, this absence of uncertainty presents two primary concerns:

    Versatility and Label Processing: LLMs can work with any set of labels, even ones whose accuracy is debatable. To avoid misleading users, they should ideally imitate human behavior by recognizing accurate labels or pointing out when they are absent. Due to their reliance on predetermined labels, traditional classifiers are not as flexible.
    Discriminative vs. Generative Capabilities: Because LLMs are mainly intended to be generative models, they frequently forgo discriminative capabilities. High-performance metrics indicate that classification tasks are easy. However, the existing benchmarks might not accurately reflect human-like behavior, which could overestimate the usefulness of LLMs.

In recent research, three common categorization tasks have been provided as benchmarks to help with further research.

    BANK77: An intent classification task.
    MC-TEST: A multiple-choice question-answering task.
    EQUINFER: A recently developed task that determines which of four options, based on surrounding paragraphs in scientific papers, is the correct equation.

This set of benchmarks has been named KNOW-NO, as it covers classification problems with different label sizes, lengths, and scopes, including instance-level and task-level label spaces.

A new metric named OMNIACCURACY has also been presented to assess the LLMs’ performance with greater accuracy. This statistic evaluates LLMs’ categorization skills by combining their results from two KNOW-NO framework dimensions, which are as follows.

    Accuracy-W/-GOLD: This measures the conventional accuracy when the right label is provided.
    ACCURACY-W/O-GOLD: This measures accuracy when the correct label is not available.

OMNIACCURACY seeks to better approximate human-level discrimination intelligence in classification tasks by demonstrating the LLMs’ capacity to manage both situations in which correct labels are present and those in which they are not.

The team has summarized their primary contributions as follows.

    When correct answers are absent from classification tasks, this study is the first to draw attention to the limitations of LLMs. 
    CLASSIFY-W/O-GOLD has been introduced, which is a new framework to assess LLMs and describe this task accordingly.
    The KNOW-NO Benchmark has been presented, which comprises one newly-created task and two well-known categorization tasks. The purpose of this benchmark is to assess LLMs in the CLASSIFY-W/O-GOLD scenario.
    OMNIACCURACY metric has been suggested, which combines outcomes when proper labels are present and absent in order to evaluate LLM performance in classification tasks. It provides a more in-depth assessment of the models’ capabilities, guaranteeing a clear comprehension of how well they function in many situations.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit

The post Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 分类任务 评估指标 局限性
相关文章