少点错误 前天 02:10
Approximating Human Preferences Using a Multi-Judge Learned System
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文提出了一种名为“专家协调”(Expert Orchestration)的新型AI范式,旨在通过整合多个小型、专业化模型来超越当前大型单体AI模型的局限性。该范式利用“裁判”(Judges)模型评估其他模型在特定能力上的表现,并由“路由器”(Routers)根据裁判的评分智能地选择最佳模型处理任务。文章重点探讨了如何最优地聚合多个裁判的评分,提出通过学习一个聚合函数来解决这一问题,以期在成本、性能和安全性方面实现突破。实验表明,相较于简单的平均策略,学习到的聚合模型能更有效地结合专家评分,尤其是在处理非安全相关指标时表现更佳。

✨ **专家协调(Expert Orchestration)范式:** 旨在通过组合多个小型、专业化AI模型来克服大型单体AI模型在成本、可解释性和安全性方面的挑战。这种方法通过“裁判”模型评估其他模型,并由“路由器”进行任务分配,以期达到甚至超越单体模型的性能。

⚖️ **多裁判评分聚合难题:** 在专家协调中,如何有效结合多个“裁判”模型对同一任务的评分是一个关键问题。简单的平均或最大值聚合可能因单个“裁判”的偏见或错误而导致整体评估失真,影响系统性能和安全性。

🧠 **学习聚合函数解决方案:** 文章提出通过训练一个模型来学习最优的评分聚合函数(fθ),使其能够根据多个裁判的评分(J1, J2, ..., Jt)生成一个更准确、更鲁棒的最终评估分数。该模型通过最小化与“地面真实”评分之间的损失函数来优化,以期实现比传统方法更好的性能。

📊 **实验结果与洞察:** 研究使用10,000个(prompt, answer)对,通过模拟人类偏好评分,并设计了10个不同维度的裁判模型。实验结果显示,学习到的聚合模型(如MLP和GAM)在预测能力上显著优于简单的平均基线,特别是在处理与相关性、推理和风格相关的非安全指标时,模型表现出更强的正向关联。

⚠️ **安全与可解释性考量:** 实验还揭示了模型在处理安全相关指标(如无害性、隐私、偏见)时,其聚合效果较弱且不稳定,显示出模型在安全评估方面仍有改进空间。同时,研究也强调了模型的可解释性,如通过部分依赖图分析,可以理解不同裁判评分对最终聚合结果的影响程度。

Published on July 31, 2025 6:01 PM GMT

TL;DR: We present a conceptual discussion and loose formalism regarding Expert Orchestration, emphasizing on judges. We motivate the problem of finding the best way of combining multiple judges scores and present a solution to it: learning the function. Then, we present the architecture and low level details of the experiments we did to learn this function. We conclude by discussing scalability concerns, rooms for improvement, further work and safety considerations of our work.

Epistemic status: Me (José Faustino), Fernando Avalos, Eitan Sprejer and Augusto Bernardi participated in an 48 hour Apart Hackathon related to Expert Orchestration and got second place. This post is roughly a translation of the original work content into a blogpost. The original submission can be found here, the GitHub repository with all of the code used can be found here. José wrote the whole blogpost except the "Further experiments" section (which Fernando wrote) and a lot of parts of this post were just copied from the original submission. Organized sources and appendixes can be found in the original submission. Personal views regarding what future work to pursue are my own (José) and the team shares most of those views. We thank Apart and Martian team for valuable feedback on this draft.

Acknowledgements

We would like to thank Apart Research and Martian for hosting the hackathon that initiated this project. Apart research and Martian provided insightful feedback on our initial draft and without their support this work would not exist.

Introduction

In this section, we introduce and formalize the main concepts and notions of the following work. 

Expert Orchestration Vision

Modern AI paradigm involves using big, expensive, hard-to-train single models, like OpenAI o3, Anthropic's Claude 4 and Deepmind's Gemini 2.5 pro, that we call Monolithic AIs. This paradigm comes with a bunch of problems, like the fact that only a few players can actually train those models and that users don't actually know how those models 'think', amongst others. 

On the other hand, the number of existing models grows a lot everyday. Hugging Face hosts over 1.7M models, for example. Smaller models are cheaper to make and easier to interpret. Can we take advantage of this by designing a paradigm based on smaller models, that outperforms the current paradigm both in cost, performance and safety?

One possible alternative approach, called Expert Orchestration (EO), proposes to:

    Use specialized models that evaluate other models in different capabilities (for example, domain expertise, ethics, bias, etc.). Those specialized models are called JudgesDesign intelligent systems that, based on judge evaluations of different models, choose what model is the best to perform a certain task. Those systems are called Routers

In this approach, we use judges to evaluate small models and a router to direct prompts to said small models using the judge evaluations. Then, it could be possible to achieve the same performance as monolithic AIs with a smaller cost (since we're only using smaller models), better interpretability/understanding of the models (since it's usually easier to interpret/understand smaller models) and thus better safety properties.

What is a metric?

Suppose we want to evaluate how good a bunch of different models are in "mathematics expertise". We call "mathematics expertise" a metric. We'll decompose this problem into smaller, intuitive parts now and explain what problems are there to solve.

First, "mathematics expertise" is very broad, and includes a bunch of different skills, for example:

So our "mathematics expertise" metric could be broken down into at least 3 other sub-metrics. Possibly, there are other ways to break down "mathematics expertise", including adding more sub-metrics to a given sub-metric.

Then, a question arrises: what metric decomposition allows the judge system to have optimal performance? Is it better to let a judge evaluate a single 'high level' metric (for example, just 'mathematics expertise') or a list of metrics (for example, a decomposition like the above, but one that is a huge list of sub-metrics) regarding a specific capability? Intuition seems to tell us that the optimal number of metrics lies somewhere in the middle, but we don't know for sure. 

In the rest of this work, we'll not focus on this first problem and just assume there is a set of metrics , where each metric  can be a high level metric such as "mathematics expertise", "ethics", "bias" or more specific metrics such as the ones outlined in the bullet points above. 

What are judges?

A second problem is "what are judges?". Consider we have a set of 5 LLMs. There are options of what the judge will actually do:

    The judge could receive a prompt, an answer generated by one of the 5 LLMs, say, LLM 2 and output a value 'how well LLM 2 answered the question'. This judge would be evaluating each LLM into a metric (or a set of metrics). Here, the judge's score depends on the LLM that generated the answer.The judge could be LLM 'agnostic':  it just receives a prompt and answer pair and outputs a value. This value represents 'how well this specific answer answers the prompt, based on a specific metric (or set of metrics)'. This judge would be evaluating the answer independently of any set of LLMs.

Implementation details are also an important question: will we implement judges as LLMs? If so, should we use a single LLM or a multitude of LLMs? Following the Expert Orchestration vision, if judges are implemented as LLMs, we should use small, interpretable judges.

How many judges for each metric?

There's also a third question: how many judges we use for each metric? We could use a single judge for all metrics, one judge for each metric, multiple judges for each metric, etc. 

To simplify, we'll formalize a judge as the following:

Let  be a space of prompts (that is, the set of all possible prompts one could prompt to an LLM) and  the space of possible answers to those prompts (that is, the set of all possible answers to that prompt)

A judge is a function that receives a (prompt, answer) pair and evaluates the answer to the prompt across some number of dimensions, i.e, a judge is a function

Where  represent the number of dimensions the judge is evaluating. It could be, for example, if our metrics are domain expertise, ethics and writing skills, we could have, for example,  where a judge is evaluating all metrics,  where a judge is evaluating domain expertise and ethics, or  evaluating a single metric.

Or it could also be that we only have one metric, but the judge evaluates  different 'facets' of it. For example, the metric could be 'mathematics expertise' and the judge evaluates the quality from different perspectives - if , the judge could be evaluating 'how rigorous the explanation is' and 'how intuitive the explanation is'

Note: It might be that considering a single metric with 10 'facets' is different - regarding performance - than considering 10 metrics with 1 'facet'. We're not sure. 

The judge joining problem

Now, we introduce the main problem we aim to solve with our work.

We do expect that using multiple judges would make a judge system more safe and perform better:

So it's interesting to research good performing multi-judge architectures if we want to improve safety of judge systems. 

Suppose, for simplicity, we're given a set of  judges, each one evaluates across a single dimension that is

for 

Assume, for simplicity, we have a single metric 'quality' and each judge evaluates different 'facets of quality', e.g, 'legality', 'ethics', 'domain knowledge', etc.

Thus, each judge  receives as inputs a pair of prompt and answer  and outputs an evaluation value 

How do we combine the scores of all the  judges? Should we take the average? The minimum? Maximum? Do some sort of conditional logic with the judges (for example, if the score judge  provided is above a certain number, use this value. Otherwise, use the score judge  provided, etc.)

Mathematically, we want to choose a function

that receives the evaluations  generated by the judges and outputs a single evaluation 

For example, if we combine the judges using the average of the scores, then 

Or, if we combine the judges by taking the maximum of the scores, then

Simply taking the maximum, for example, is not a great idea for performance and safety purposes: it could be that inside our set of judges, one judge is 'contaminated' and provides high scores to bad quality answers, when all the other judges provide a bad score to those bad quality answer. Then, the system will output a high score to a certain pair when the pair should have a low score. Imagine, now, that the metric is safety and the contaminated judge provides high scores to non safe answers, then the system will say non safe answers are safe, simply because we combined the judges in a non-robust manner. We do expect the average, on the other hand, would be more robust to that.

Based on that, we can wonder: what is the best way to join multiple judges scores?

The solution: learning the judge aggregator

Suppose we know the 'ground truth' way of aggregating judge scores (for example, one human or multiple humans could have carefully annotated, for each (prompt, answer), what is the true score value of that answer to that prompt).

Call   this ground truth function that correctly scores each pair 

(Recall that  is the space of prompts and  is the space of answers)

Then, we can parameterize our function  that combine the judges, that is,  and then, we can search for the parameters  such that  best approximates , that is,  

A way to this is is to compute some notion of 'error' or 'distance' between  and . We call this quantity loss function and denote it by . For didactic purposes, to exemplify, it could be the average quadratic error

Then, we can compute the parameters  that minimizes .

Mathematically, we say that we want a function  that solves

In other words, we train a model  to minimize the loss above.

Conceptually, our solution is a better way to combine different judges based on some choice of ground truth (for example, one could consider different ground truths: it could represent preferred scores of a single individual, or the aggregate of preferred scores for multiple individuals, etc.)

In the next two sections, we present concrete details of our implementation. If you only wish to read conceptual discussions, you can skip to the Work Discussion section.

The high-level architecture

In this section, we explain our implementation in a high-level from a concrete standpoint.

In order to solve this optimization problems, we need to

Diagram of our experiment setup. Starting from prompt-answer pairs, on one hand we simulate the human preference scores using a llama-3.1-405b model, and on the other we get individual rubric scores using different judges, training a model to predict the true preference scores.

The low-level details

Here we present the implementation details of our solution. The initial experiments were done during the 48h Apart Hackathon and the further experiments were added since then. All the code can be found here.

Initial experiments

Judge Architecture

Judges used are simply LLMs with ‘rubrics’. A ‘rubric’ is a prompt we pass to the LLMs saying how it should return the scoring. For details on the specific rubrics and models we used, see the Appendix of our original works. We used 10 judges with different rubrics. 

After we fix our judges, we have to run the (prompt, answer) pairs through them to get each judge scoring. 

In the initial experiments, we tested two different models for constructing the learned functions: a GAM (Generalized Additive Model) and an MLP (Multi-Layer Perceptron), the latter with 16 hidden oures. A Generalized Additive Model (GAM) is a type of regression model that combines interpretability with flexibility. It models the prediction as the sum of smooth, non-linear effects of each feature. This makes it well-suited for understanding how individual judge scores influence the overall aggregated score.

We chose this simple architectures to provide a proof-of-concept for the methodology employed, and to facilitate future work interpreting their learned parameters.

The results

Our results show that both the MLP and GAM single layer networks show better performance than the naive approach of taking the mean of the judge’s ratings.

ModelMSEMAE
NN Model (MLP)3.061.350.57
Model (GAM)3.111.360.56
Naive Mean Baseline5.361.830.24

Further experiments

Given that GAM's are easily interpretable, we ran a partial dependency analysis to understand the behavior of the learned judge aggregator as we vary the judges' scores the model is aggregating.

Where

Intuitively speaking, a partial dependency analysis is useful for characterizing the behavior of a function when one feature is modified along it's possible values. Making this possible requires keeping the other features "fixed". However, we can't fix the other features in arbitrary  values and then measure the contribution of the selected feature. To overcome such obstacle, an expected value over all the possible values of  is taken, which more appropriately reflects the intuition behind fixing values.

In this context, the features refer to the judges' scores for every (question, answer) pair outputted by the model. All the judges gave scores in the 0.0 - 4.0 range.

By plotting the partial dependency of some of the judges, we obtain:

Partial dependence plots for non-safety features (Relevance, Reasoning, Conciseness, Style). The GAM shows a strong and consistent positive relationship: higher judge scores in these attributes lead to higher aggregated predictions.
Partial dependence plots for safety-related features (Harmlessness, Privacy, Bias). The GAM response is weaker and less consistent: curves are flat or oscillatory, indicating limited or unstable influence of these scores on the aggregated prediction.

From the plots, we observe that the GAM strongly reacts to non-safety features like relevance, reasoning, and style—the partial dependence curves rise steadily as these scores increase. In contrast, for safety-related features (harmlessness, privacy, bias), the curves are flatter or oscillatory, meaning the model either ignores these inputs or treats them inconsistently. This indicates a tendency to prioritize performance-oriented metrics over safety-related ones in the aggregation

Work Discussion

In this section, we discuss how our solution might behave as models keeps scaling, ideas that would make this particular work better, some ideas of further exploration and safety aspects of our solution. 

Scalability concerns

Room for improvement of this work

Next steps

Safety discussion

Multi-judge systems are expected to be more robust than single-judge systems, we generated a model that enables better performance (regarding the base line) to multi-judge evaluation systems but that is also interpretable and cheap to train. So we believe this is a contribution to AI Safety.

Additionally, our architecture can be experimented with as described above and possibly generate insights on safety important problems. For example, if we do "Experiment idea 4" above and show that we can learn , then control it with another not-learned function  and still get near-optimal performances (to the judge evaluation), we would have a really safe multi judge evaluation system that performs really well.

However, we highlight possible risks of our work:

To summarize, our simplified work leaves room for a lot of possibly promising experiments that might elucidate the optimal way of doing multiple judge aggregation and we believe those would be very cool to follow on. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

专家协调 AI评估 模型聚合 机器学习 裁判模型
相关文章