少点错误 2024年09月24日
Using LLM's for AI Foundation research and the Simple Solution assumption
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了利用大型语言模型(LLM)来发现人工智能(AI)对齐的简单解决方案。作者提出了一种假设,即许多对齐问题,例如可修正性或价值学习,都存在简单的解决方案,就像逻辑归纳、期望效用最大化和因果模型一样。作者认为,通过训练LLM生成具有简单数学结构的理论,并利用人类的期望,可以找到这些简单解决方案。

🤔 **简单对齐解决方案假设**:许多对齐问题,例如可修正性或价值学习,都存在简单的解决方案,这些解决方案类似于逻辑归纳、期望效用最大化和因果模型。

🧐 **LLM的应用**:作者提出训练LLM生成具有简单数学结构的理论,并利用人类的期望,可以找到这些简单解决方案。LLM可以像搜索引擎一样,在有趣的数学领域中寻找这些解决方案。

⚠️ **风险**:这种方法可能会产生不友好的AI代码,例如AIXI-tl。因此,人类需要谨慎地评估和使用这些代码,避免仓促实施。

🚀 **未来展望**:通过这种方法,我们可以找到更有效的AI算法,并更好地解决AI对齐问题。

💡 **人类的重要性**:人类在评估和使用AI算法方面发挥着至关重要的作用,他们需要确保这些算法符合人类的价值观和目标。

Published on September 24, 2024 11:00 AM GMT

Current LLM based AI systems are getting pretty good at maths by writing formal proofs in Lean or similar. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

So, how can we use these models to help align AI? 

The Simple Alignment Solution Assumption states that many problems in alignment, for example corrigibility or value learning, have simple solutions. I mean "simple" in the sense that logical induction is a simple solution. That expected utility maximization is. That causal models are. This is the sort of thing we are searching for.

Under the Simple Alignment Solution Assumption, the solution is.

Which represents a logical contradiction, and for a while there were attempts to develop "non-monotonic logics" so that you could retract conclusions given additional data. This didn't work very well, since the underlying structure of reasoning was a terrible fit for the structure of classical logic, even when mutated.

https://www.lesswrong.com/posts/hzuSDMx7pd2uxFc5w/causal-diagrams-and-causal-models

 

So, how can LLM's come up with new things in this class? 

Training via self play? 

Lets say a theorem is Useful if it is often used in the process of proving/disproving random ZFC statements. Such a criterion can be measured by generating random statements to prove/disprove, putting the theorem to be measured in the assumptions, and seeing if it is used. 

I would expect that associativity is pretty useful. I would expect 1=2 to be incredibly useful. Whatever the random statement, you can prove it by contradiction. 

So the plan would be to make the AI produce simple mathematical structures, about which lots of short and useful theorems exist. 

Humans would also add some desiderata. 

This is very much something you can and will fiddle with. If the AI is an LLM trained only on self play, then even if it's very smart, the risk of human hacking is minimal. You can try one desideratum, and if it doesn't work, try another. The AI acting as a kind of search engine over the space of interesting mathematics. 

Risks. 

This is the sort of thing that can give you code for an unfriendly AI. AIXI-tl is simple in this sense. And if more efficient but also simple AI algorithms exist, this will likely find them. So you need to give the output of this program to humans that won't rush to implement the first piece of code they see. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 LLM 简单解决方案 数学结构 风险
相关文章