少点错误 7小时前
Recent Redwood Research project proposals
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文分享了一系列关于AI控制与对齐的未完成项目提案,涵盖了控制协议、监控协议、串通、诱导、合成信息输入、训练时对齐方法、强化学习与奖励黑客攻击、更好的理解与可解释性等多个研究方向。这些项目旨在探索如何更好地控制AI,防止其产生不良行为,并确保AI与人类价值观对齐。文章强调了这些项目的多样性、重要性以及潜在的研究价值,为AI安全研究提供了有益的参考。

🛡️ **控制协议迁移**: 研究不同控制协议在不同环境下的效果是否一致,以评估其通用性。

🕵️ **人类审计与后门检测**: 探讨如何利用人类审查来检测AI的异常行为,以及如何改进人类在这一过程中的表现。

🤖 **训练攻击策略**: 探索训练模型生成自身攻击策略的可能性,以提高控制系统的适应性。

📡 **监控协议优化**: 设计新的监控协议,提高现有协议的效率,以更好地捕捉AI的潜在不良行为。

🤝 **串通与非信任监控**: 研究在AI模型之间存在串通的可能性下,如何进行有效的监控,以及如何降低串通的风险。

🧠 **合成信息与输入**: 研究如何系统性地让AI相信虚假信息,以测试其对外部输入的敏感性。

Published on July 14, 2025 10:27 PM GMT

Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.

Control

These projects are all related to the field of AI Control. Many of them are extensions of Redwood's previous work in this area.

Basic open questions in control

Monitoring protocols

Control setups include a protocol for monitoring potentially problematic actions, and then using that monitoring in a way so as to reduce the chance that a malicious model can take bad actions. These proposals relate to new protocols for catching bad actions, or making older protocols more efficient.

Untrusted monitoring and collusion

Once we're working with very capable untrusted models, they may be able to easily fool any model dumb enough that we would consider it trusted. This means we might want to use the untrusted model itself, or a modified version of it, for monitoring. This is obviously problematic in the case that the policy model and the monitor model collude. These projects study this sort of collusion and the general dynamics of untrusted monitoring.

Elicitation, sandbagging, and diffuse threats (e.g. research sabotage)

Synthetic information and inputs

Training-time alignment methods

Science of (mis-)alignment

Alignment / training schemes

There might be various training tricks we could use to make it more likely we will end up with an aligned model.

RL and Reward Hacking

Better understanding and interpretability

Other



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI控制 AI对齐 机器学习 安全研究
相关文章