少点错误 2024年08月18日
Behavior Cloning for Alignment & Immortality
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文提出了一种利用第一人称视角视频数据训练大型语言模型的新方法,该方法通过收集大量志愿者的日常视频,训练出能够模拟人类行为的AI模型。这种方法不仅能提高模型的性能和安全性,还能为人类提供一种新的“数字永生”方式,并比冷冻技术更具可行性。

🤔 利用第一人称视频数据训练AI模型:本文提出了一种新的训练方法,即通过收集大量志愿者的日常第一人称视角视频,来训练AI模型。这种方法能够让AI模型学习到人类行为背后的复杂因果关系,从而提高模型的性能和安全性。

🤖 模拟人类行为:通过这种方法训练出来的AI模型可以模拟人类行为,预测人类的下一个动作,并根据人类过去的经验进行调整。这将为AI领域带来新的可能性,例如模拟人类的道德偏好、创造更逼真的游戏角色等。

✨ 数字永生:本文认为,这种方法可以为人类提供一种新的“数字永生”方式。通过收集人类的行为数据,我们可以将人类的思维和行为模式“数字化”,即使人类的身体死亡,他们的“数字副本”仍然可以存在。

🛡️ 安全性和伦理问题:本文也探讨了这种方法的安全性和伦理问题,例如数据隐私、模型偏差等。作者认为,通过选择合适的志愿者和制定合理的伦理规范,可以将这些风险降到最低。

💰 经济可行性:作者认为,这种方法在经济上是可行的,只需要少量的资金和资源。例如,只需要1000名志愿者,在5年内收集1000万小时的视频数据,就可以训练出一个高质量的AI模型。

👓 硬件选择:作者还讨论了硬件选择问题,例如相机类型、电池寿命等。作者认为,现有的技术已经能够满足这种方法的需求,例如可以利用可穿戴式相机或眼镜等设备来收集数据。

🚀 未来展望:作者认为,这种方法具有巨大的潜力,可以推动AI领域的发展,并为人类社会带来新的福祉。

📊 数据需求:作者认为,训练一个高质量的AI模型需要大量的数据,但目前还没有一个明确的标准来衡量数据量和模型性能之间的关系。作者建议,可以参考现有的视频模型训练方法,例如Stable Video Diffusion,来确定数据需求量。

🧠 与全脑仿真比较:作者将这种方法与全脑仿真进行比较,认为全脑仿真虽然具有更强的理论基础,但也面临着技术和伦理上的挑战。而这种方法则更具可行性,并且能够实现一定程度的“数字永生”。

Published on August 17, 2024 11:42 PM GMT

Summary: The collection of a massive first-person video dataset for use in large model pretraining is technologically & economically feasible with modest resources. Such a dataset might enable high-fidelity imitation of human behavior, with positive implications for the capabilities and safety of trained models. Such models may also enable a type of “digital immortality” and provide a compelling alternative to cryonics.

Table of Contents

First Person Behavior Cloning

      Implications for Capabilities 

      Implications for Alignment

      Digital Immortality

Implementation

      Hardware Options

      Economic Feasibility

      Privacy & Legal Concerns

First Person Behavior Cloning

Current frontier AI systems rely on pretraining via next-token-prediction on web-scale (primarily text) datasets. At the limit, this training objective is sufficient to enable human-equivalent artificial intelligence (to perfectly predict the next words of the person that wrote the text implies an intelligence at least as great as the writer). However, models produced by this process do not necessarily reflect the same type of intelligence as the people creating the data:

Consider that some of the text on the Internet isn't humans casually chatting. It's the results section of a science paper. It's news stories that say what happened on a particular day, where maybe no human would be smart enough to predict the next thing that happened in the news story in advance of it happening. As Ilya Sutskever compactly put it, to learn to predict text, is to learn to predict the causal processes of which the text is a shadow. Lots of what's shadowed on the Internet has a complicated causal process generating it.[...]

GPT-4 is still not as smart as a human in many ways, but it's naked mathematical truth that the task GPTs are being trained on is harder than being an actual human. And since the task that GPTs are being trained on is different from and harder than the task of being a human, it would be surprising - even leaving aside all the ways that gradient descent differs from natural selection - if GPTs ended up thinking the way humans do, in order to solve that problem.

– Yudkowsky, 2023[1]

In part, the increased difficulty of the model’s task (and the corresponding differentness of the model’s intelligence) is a consequence of the gap between the output of the generating process and the process itself — i.e. web text is far removed from the cognition of human writers. Human-generated text is the manifestation of complex processes, the inner workings of which are not well captured by the content of the text alone:

This gap contributes not only to the difficulty in training performant models (because the search space of explanatory models for observed output is so underconstrained), but also has negative implications for model safety. Given that we agree that robust generalization when predicting human behavior is important (which it likely is for tasks like modeling moral preferences) then the way that web text data obscures the underlying dynamics leading to its creation makes modeling of those dynamics more difficult. In other words, the more information that the output of some process contains about the process itself, the easier it is for a predictor trained on that output to generalize when simulating that process.

These disadvantages point to the obvious question - would an alternative pretraining dataset be able to reduce this gap?

To address these challenges, I propose the construction of a massive, long-horizon, first-person-view, video dataset. Concretely, I advocate for thousands of volunteers to capture continuous video footage of all waking hours using body-mounted cameras, for several years. I expect such a dataset to enable the creation of high fidelity human behavior simulators, capable of predicting the next actions of a given person, conditioned on all the past recorded experiences of that person. The construction of a dataset on the scale of 10s of millions of hours is possible with only 1000 volunteers, in less than 5 years, using commodity hardware, and at a cost of < $2 million USD. I draw inspiration from past writing on whole-brain-emulation[2], lifelogging[3], and behavior cloning[4].

It is possible to imagine even richer sources of pretraining data than FPV video. Consider an fMRI recording a volunteer's brain activity continuously for 5 years: this could be a far richer data source than only observing outputs, since the human thought we want to simulate is, in principle, visible directly within this activity. Obviously there are some practical blockers that prevent this from being workable (high cost, low resolution of scans, the desire to live life outside of an MRI machine, etc). While FPV data is clearly an incomplete representation of the generating process – since it does not capture the inner thoughts of the volunteer – the expectation is that it reflects far more information about human thought than web text data.

From an alignment & safety perspective, FPV cloning inherits many of the desirable attributes of Whole Brain Emulation (WBE) alignment proposals:

It’s worth acknowledging that the benefits for FPV cloning are somewhat diminished versus WBE due to the greater amount of approximation involved. WBE usually implies a biologically inspired approach, one which has access to physical information about the subject’s brain. As a result, WBE is often assumed to have strong guarantees of faithful reconstruction of the original person’s behavior. Using observed behavior alone to train a simulator model leaves a larger gap between process and output. This increases the risk that the simulated mental machinery will deviate significantly from the original human cognition, and with it the risk that simulants will take undesirable actions when put in out-of-distribution scenarios. However, I think the risk here is not too great — the expectation is that the output-process gap in FPV video is sufficiently small that it constrains well performing models to be implementing algorithms that are very similar (and generalize in similar ways) to those found in the human brain.

Separate from the issues of model safety & capabilities, the creation of an FPV behavior cloning model could enable a type of “digital immortality” - providing a form of continued existence for contributors beyond the limits of their biological life. Assuming that we take substrate independence as true, it’s not a far stretch to consider a whole-brain-emulation of you to really be you. However, in the process of mapping brain matter for WBE, there would undoubtedly be inaccuracies and mismeasurement.To consider the WBE to be a valid reconstruction, we must accept that the validity of the emulation relies on making the reconstruction sufficiently close to the original. As such, being alive as yourself is less of a binary state, and more of a continuum where more of you is alive depending on how faithful the reconstruction is. Recreation of human thought based on FPV behavior cloning is perhaps lower on this continuum than a high accuracy biological simulation, but I’d consider it sufficiently high to be a form of continued existence (albeit an imperfect one). From a practical standpoint, FPV cloning has a clearer path to implementation than alternatives like cryonics, where there’s been little progress over the last 30 years in either the resurrection or digital emulation of preserved brain matter. It is also less sensitive to the circumstances of your death.

Implementation

Data Requirements

It’s difficult to estimate the volume or diversity of pretraining data required for a given set of capabilities. Similar to modeling with text, we’d expect capabilities to emerge at unpredictable moments as we increase the scale of data and compute. On the upper bound, we’re limited by both financial resources and by the difficulty in convincing a large number of volunteers to collect data. Assuming it’s practically feasible to recruit 1000 volunteers, and that each volunteer records 14 hours per day on average, and that all volunteers provide data consistently for 5 years, this would result in 2917 expected years of footage. On the lower bound, we can look at current video-modeling efforts for reference. Although the details aren’t available for OpenAI’s Sora, there exist open source alternatives like Stable Video Diffusion. According to the Stability AI whitepaper, their video diffusion model was trained on ~200 years of data, or 1 OOM less than our upper bound[5]

Hardware Options

There are a number of different form factor options, including clip, necklace, pin, & glasses:

The scale of collected data (full day usage, continuous video) will require different hardware considerations versus past lifelogging equipment. In practice, this almost certainly means an external source of power separate from the camera itself. Something like the belt worn microphone battery packs used by performing artists could be an inspiration:

The simplest implementation would be a small clip that attaches to the front of the shirt. The clip could be a small enclosure for a Raspi Camera Module, a low cost, high resolution camera module[6] that’s attached via a long cable to a waist-attached battery/compute box, containing a Raspi computer, that handles recording the video to an SD card and uploading the video to the long term storage location whenever there’s an available Wi-Fi signal.

Economic Feasibility

Some rough estimates of the involved costs:

DescriptionAmount
Cost Per Recording Device$300 USD
Recording Devices for 1000 People$300,000 USD
14 Hours Per Day x 5 Years25550 Hours
Storage for 1 Hour of 2k Footage 3 GB
Storage for 25550 Hours76650 GB ≈ 77TB
Storage for 1000 People77000 TB ≈ 77 PB
Cost Per Terabyte Aug 2024 $10 USD 
Storage Cost Per Contributor$770 USD
Storage Cost for 1000 People$770,000 USD
Total Material Costs for 1000 People$1,070,000 USD

Organization Management for 5 Years

Recruiting Volunteers, Logistics, etc

$300,000 USD
Total Estimate Dataset Collection Cost$1,373,000 USD

Open Questions & Organizational Governance

This project raises many practical and ethical concerns. A small sample:

Closing Thoughts

I write this with several motivations:

  1. ^
  2. ^
  3. ^
  4. ^
  5. ^
  6. ^


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 第一人称视频 数字永生 行为克隆 模型训练
相关文章