少点错误 07月25日 03:12
Fullrank: Bayesian Noisy Sorting
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Fullrank是一款交互式命令行工具,用于在存在不确定性的情况下对列表进行排序。它采用贝叶斯推理方法,通过询问用户对项目进行两两比较,直到后验分布的熵足够低。与传统的频率学方法不同,Fullrank基于Thurstonian模型,使用概率模型来处理不确定性,并提供了一种更 principled 的方式来量化排序的准确性并决定何时停止比较。该工具能够回答“项目 i 真正是最佳项目的概率是多少?”这类问题,并能高效地从后验分布中采样,计算各种统计量。Fullrank为处理带有噪声的列表排序问题提供了一个强大的贝叶斯解决方案。

📊 **Fullrank的贝叶斯推理核心**:Fullrank是一个交互式命令行工具,专门用于在存在不确定性的情况下对列表进行排序。它引入了贝叶斯推理方法,通过引导用户对列表中的项目进行两两比较,逐步收集信息,直到对项目排名的后验分布拥有足够低的熵(即不确定性降低)。这种方法与传统的频率学方法不同,能更有效地处理“噪声比较”的情况,例如在评估AI模型性能或用户偏好时,往往存在不确定性,而非简单的“大于”或“小于”。

📈 **Thurstonian模型与Bradley-Terry模型的对比**:文章详细介绍了Fullrank使用的Thurstonian模型,该模型基于概率模型p(i>j)=Φ(si-sj),其中Φ是标准正态累积分布函数,si是项目i的潜在技能值。这与Bradley-Terry模型(p(i>j)=σ(si-sj),σ为Logit函数)有所不同。作者提到Bradley-Terry模型在贝叶斯设置下难以处理,而Thurstonian模型在后验分布的推导上更为便利,并最终导出了一种称为“统一偏态正态(SUN)”的后验分布,Fullrank正是利用了这一分布的特性进行高效采样。

💡 **高效采样与最优比较选择**:Fullrank利用了SUN分布的卷积表示,通过对特定分布(如截断正态分布)进行采样,从而高效地从后验分布中提取信息。在选择下一个最能提供信息的比较时,Fullrank当前假设熵随偏斜参数的L2范数减小而减小,以此来指导用户进行最有效的比较,从而更快地收敛到可靠的排序结果。作者也欢迎社区就更优的比较选择策略提出建议。

🤔 **Fullrank的优势与应用场景**:相比于其频率学前身,Fullrank提供了量化排序准确性的 principled 方法,并能回答关于项目相对排名的概率性问题,这是频率学方法难以做到的。这使得Fullrank在需要精确评估和理解不确定性排序的场景中具有显著优势,例如在专家评估、用户偏好调查、机器学习模型能力比较等领域。

Published on July 24, 2025 7:03 PM GMT

Fullrank is an interactive CLI tool for Bayesian inference of list rankings based on noisy comparisons. It takes a list of items, then efficiently prompts the user to compare pairs of items until the user decides that the posterior distribution is sufficiently low entropy. It can then sample from the resulting posterior distribution and compute various statistics.

GitHub Repository

Background

Deterministic sorting algorithms rank lists by comparing pairs of items. If an item is greater than another, it is moved higher in the list. However, sometimes it is uncertain which item is greater. For example:

Estimating rankings in the presence of this uncertainty is called noisy sorting. A common approach is to model comparisons between items as depending on a latent numerical value ("skill" or "rating") for each item. For example, the commonly used Bradley–Terry model assumes that

where  denotes the latent skill of item  and  is the logistic function.

Motivation

@gwern's Resorter is a CLI tool for noisy sorting of lists based on the Bradley–Terry model. However, its frequentist approach limits it in a few ways:

As a project to learn more about Bayesian inference, I decided to build a Bayesian version of Resorter.

Thurstonian Model

The Bradley–Terry model is quite nice for maximum-likelihood estimation, but I was unable to get it to work well in a Bayesian setting. Given a normal prior on the skills , the posterior density is

where  denotes the normal density,  is the number of comparisons, and  and  are the winning and losing items in comparison . It appears some researchers have designed efficient sampling procedures for this posterior, but frankly they are beyond me.

Instead, I used a probability model very similar to Bradley–Terry, but using a probit link instead of a logit link. That is, under the Thurstonian model,

where  denotes the cumulative distribution function of the standard normal distribution.

I'll now derive the posterior density in the Thurstonian model. For convenience, I'll represent the observed comparisons as a matrix  mapping score vectors to probits for each comparison. That is,  if item  wins comparison  if item  loses comparison , and  otherwise.

where  and .

It turns out that the normalization constant can be represented quite nicely using the multivariate normal CDF :

And since , we have

Likewise, . Therefore,

This is called a unified skew-normal (SUN) distribution, and it is the posterior of most probit models. Using the convention of Arrellano-Valle and Azzalini[1], we can write

Efficient Sampling

Arrellano-Valle and Azzalini[1:1] also gives us a convolutional representation of the posterior. If

where  denotes the normal distribution truncated below , then

Fullrank exploits this fact to efficiently sample from the posterior using samples of  and .

Optimal Comparison Choice

Ideally, Fullrank should always present the user with the most informative comparison. That is, the comparison whose probit has maximal entropy.

Unified skew-normal distributions are closed under full-rank linear transformations, so each comparison probit is distributed according to a one-dimensional SUN distribution. At least in the case of a standard normal prior, each comparison has identical first and second moments, so intuitively the entropy should be controlled by the skewness.

Fullrank currently assumes that the entropy is decreasing in the  norm of the skewness parameter , which seems to work well in practice. However, I haven't been able to prove that this works, and it definitely fails for certain non-scalar choices of prior covariance (though these are currently not supported anyway). If you have any better ideas for choosing comparisons, please let me know!


    R. B. Arellano-Valle and A. Azzalini, "Some properties of the unified skew-normal distribution," Statistical Papers, vol. 63, pp. 461–487, 2022. doi:10.1007/s00362-021-01235-2 ↩︎ ↩︎


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Fullrank 贝叶斯推理 列表排序 Thurstonian模型 不确定性量化
相关文章