Zeroth Principles of AI 03月25日 23:50
It Takes A Species
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了大型语言模型(LLMs)发展过程中,语料库构建的重要性以及人文领域所面临的挑战与机遇。作者指出,AI的知识边界受限于其训练语料库,因此语料库的选择和构建至关重要。随着AI技术的进步,未来AI将能够学习更多领域知识,这使得构建“有用且平衡”的语料库成为一项重要的人文责任,以确保AI的目标与人类目标一致,推动文明进步。文章呼吁人文社科领域积极参与到AI的培养中,共同塑造AI的未来。

🧠 AI 学习依赖于语料库:LLMs通过阅读大量文本学习语言,其知识范围受限于训练语料库。语料库是AI理解世界的基础,决定了AI的认知上限。

💡 知识领域拓展:随着技术进步,AI将扩展至商业、数学、法律、医学和物理学等领域。语料库将成为定义LLMs“信念”的关键,影响深远。

🤝 人文领域的重要性:目前,AI发展主要依赖技术人员,但未来人文社科领域将在AI的培养中扮演关键角色,负责构建“有用且平衡”的语料库,确保AI目标与人类目标一致。

🌍 构建共识现实语料库:作者呼吁构建“有用美国公民共识现实语料库”,以确保AI能够基于共同的价值观和现实认知进行学习,从而推动文明进步。

People in the humanities are aghast at the lies confabulated by current LLMs. To them, I say: Just wait. This will shortly become YOUR problem.

It takes a species to raise an AI.

Us Techies have managed to cobble together a machine that can learn any language on the planet. This breakthrough was achieved after scouring the internet for text to read. It takes a lot of text to learn a language, if you literally have no life.

Something that few people outside of the AI community understand is the importance of the AI's learning corpus. This is the collection of books and online texts we give it to read when we are raising it. The corpus is not only important, it is the only thing that matters. It provides a hard upper limit to how much the machine can know about anything in the world.

If it wasn't in the corpus, where would it have learned it?

This is a statement in Epistemology. This is the level you need to operate at, in order to understand why AI works. As opposed to how it works. At this level, machine learning is like human learning, machine understanding of anything works a lot like human understanding, abstraction is abstraction, etc.

A modern LLM learns to understand language using mechanisms similar to those in human brains. Any knowledge it has of the world described in the corpus (as opposed to viewing it as a mere sample of “language”) is a bonus we hoped for but didn't really have a right to expect. Because learning Math and Physics and Cooking wasn't a goal, at least initially. Language is hard enough. I have myself never trained a GPT style Model from scratch, but my own LLM design needs to read my smallish corpus several times because during the first read-throughs it is still just learning character combinations.

Which means that even if it was in the corpus, the system may not have learned it. If you tried to learn Finnish from scratch by reading a Finnish encyclopedia from end to end, you wouldn't understand enough Finnish to learn actual content until maybe halfway through the work.

But now that these devices know languages, we will be raising more competent ones that know more about the world. Learning is expensive, so we will initially prioritize profitable problem domains like Business, Math, Law, Medicine or Physics. Improvements in hardware and algorithms lets them learn more domains, and to get deeper into each.

The world knowledge corpus will largely define what LLMs believe. Potentially all of them. This is a major responsibility, and the task needs proper attention.

-- * --

Techies are trying to create a useful system out of something that starts out without ANY common sense, no body, no smell, no touch, and (at least in the beginning) no vision, no sound. Just an input sense of text, and maybe voice.

English majors and their ilk are sitting on the sidelines. Some are criticizing the results, clearly expecting an intelligent system, perhaps even an AI oracle, rather than a system that merely understands language well.

Techies got this far without applying much of specialized skills in Education, Psychology, Ethics, Law, or Politics by just grabbing all text we could find on the Internet and calling it a corpus.

It will become a job for the Humanities to raise our AIs and to worry about AI alignment – to ensure that their goals align with our own goals. To assemble the corpora that will create useful and well balanced AIs which will be able to move civilization forward for the benefit of all.

We need a Useful US Citizen’s Consensus Reality Corpus.

Who gets to curate it?

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 LLM 语料库 人文
相关文章