Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation

cs.AI updates on arXiv.org 3小时前

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

科学研究的基石在于可复现性，尤其在机器学习评估中，人类反馈的差异性对结果的可靠性和信任度有着深远影响。然而，当前研究对忽略这种差异的后果关注不足，部分原因在于多标注数据成本高昂。本文深入探讨了在固定预算下，通过调整数据样本数量（N）和每样本标注数（K）来优化机器学习评估的策略。研究发现，即使考虑到人类反馈差异，所需的总标注量（N*K）通常也远低于1000，且最优配置往往需要K值大于10。此外，N与K的权衡关系因评估指标而异，对整体反馈分布更敏感的指标在高K值下表现更佳。该研究为机器学习从业者提供了更有效的测试数据收集方法，以实现预算内的最高可靠性。

人类反馈差异对机器学习评估的可靠性至关重要，但现有研究对此关注不足，主要因多标注数据成本较高。本文旨在解决如何在有限预算下优化数据收集，以提高评估的准确性和可信度。

研究的核心在于分析数据样本数量（N）与每样本标注数量（K）之间的权衡关系，以在固定总标注成本（N*K）下实现最可靠的机器学习评估。研究通过分析多标注数据集和模拟数据分布来探索最优的(N, K)配置。

研究结果显示，即使考虑人类反馈的固有差异，实现可靠评估所需的总标注量（N*K）通常在1000以下，且最优配置往往需要较高的每样本标注数（K > 10）。这表明在数据收集时，提高标注的精细度比单纯增加样本数量更为关键。

N与K之间的权衡并非一成不变，而是取决于所使用的评估指标。对数据整体反馈分布更敏感的评估指标，在高K值下能获得更好的评估效果，意味着更细致的标注能够捕捉到更多有价值的信息。

arXiv:2508.03663v1 Announce Type: cross Abstract: Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple annotators for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items ($N$) and the number of responses per item ($K$) needed for reliable machine learning evaluation. We analyze a diverse collection of categorical datasets for which multiple annotations per item exist, and simulated distributions fit to these datasets, to determine the optimal $(N, K)$ configuration, given a fixed budget ($N \times K$), for collecting evaluation data and reliably comparing the performance of machine learning models. Our findings show, first, that accounting for human disagreement may come with $N \times K$ at no more than 1000 (and often much lower) for every dataset tested on at least one metric. Moreover, this minimal $N \times K$ almost always occurred for $K > 10$. Furthermore, the nature of the tradeoff between $K$ and $N$ -- or if one even existed -- depends on the evaluation metric, with metrics that are more sensitive to the full distribution of responses performing better at higher levels of $K$. Our methods can be used to help ML practitioners get more effective test data by finding the optimal metrics and number of items and annotations per item to collect to get the most reliability for their budget.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签