New Paper: Infra-Bayesian Decision-Estimation Theory

少点错误 04月10日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文提出了一个广义的“结构化观测决策”框架，允许使用鲁棒（即多值）模型。在该框架中，每个模型将每个决策与结果的凸概率分布集合相关联。自然可以选择该集合中的分布，这种选择可以是任意的（对抗性的），并且可以依赖于过去的记录。由此产生的框架比经典的bandits和强化学习具有更强的通用性，因为可实现性假设变得更弱且更现实。作者推导了该框架的后悔界限理论。虽然上下界限不紧，但足以完全表征幂律可学习性。研究展示了鲁棒线性bandits和表格鲁棒在线强化学习的两个特例，并在两种情况下都推导出了改进的后悔界限。

💡该研究提出了一个广义的框架，用于“结构化观测决策”，允许使用鲁棒模型。在鲁棒模型中，每个决策与结果的凸概率分布集合相关联，自然可以在该集合中以任意方式选择分布。

🔍研究推导了该框架的后悔界限理论。虽然上下界限并非紧密，但足以完全表征幂律可学习性，这为理解和优化在线决策过程提供了理论基础。

🚀研究在鲁棒线性bandits和表格鲁棒在线强化学习这两个特例中验证了该理论，并推导出了改进的后悔界限，证明了该框架在实际应用中的有效性。

🤝该研究将Foster等人的“决策估计系数”理论推广到“鲁棒”（infra-Bayesian）设置，建立了infra-Bayesianism和Garrabrant归纳之间的紧密联系，促进了逻辑不确定性研究的统一。

Published on April 10, 2025 9:17 AM GMT

Official title: "Regret Bounds for Robust Online Decision Making"

Abstract: We propose a framework which generalizes "decision making with structured observations" by allowing robust (i.e. multivalued) models. In this framework, each model associates each decision with a convex set of probability distributions over outcomes. Nature can choose distributions out of this set in an arbitrary (adversarial) manner, that can be nonoblivious and depend on past history. The resulting framework offers much greater generality than classical bandits and reinforcement learning, since the realizability assumption becomes much weaker and more realistic. We then derive a theory of regret bounds for this framework. Although our lower and upper bounds are not tight, they are sufficient to fully characterize power-law learnability. We demonstrate this theory in two special cases: robust linear bandits and tabular robust online reinforcement learning. In both cases, we derive regret bounds that improve state-of-the-art (except that we do not address computational efficiency).

In our new paper, we generalize Foster et al's theory of "decision-estimation coefficients" to the "robust" (infa-Bayesian) setting. The former is the most general known theory of regret bounds for multi-armed bandits and reinforcement learning, which comes close to giving tight bounds for all "reasonable" hypothesis classes. In our work, we get an analogous theory, even though are bounds are not quite as tight.

Remarkably, the result also establishes a tight connection between infra-Bayesianism and Garrabrant induction. Specifically, the algorithm which demonstrates the upper bound works by computing beliefs in a Garrabrant-induction-like manner^[1], and then acting on these beliefs via an appropriate trade-off between infra-Bayesian exploitation and exploration (defined using the "decision-estimation" approach).

It seems quite encouraging that the two different theories which came out of thinking about "logical uncertainty" (infra-Bayesianism and Garrabrant induction) can be unified in this manner^[2], boosting out confidence that we are on the right path.

^{^}
In the start of each episode.
^{^}
Although we think that a fuller treatment of logical uncertainty requires introducing metacognition.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签