Zeroth Principles of AI 01月28日
Early Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了莫里斯蠕虫的作者Robert Tappan Morris开发的早期拼写检查程序与现代大型语言模型(LLMs)之间的联系。该程序通过学习文本中的三字母组合统计数据来检测拼写错误,这种方法无需预先指定语言,与传统的基于词典的自然语言处理方法不同。现代LLMs通过学习多种语言来理解语言,同样采用了整体学习方法。早期拼写检查程序在Unix系统中以typo.c的形式存在,它通过学习语料库来检测新文本中的错误,这种方法具有语言无关性,与LLMs的语言学习方式异曲同工,展示了早期自然语言理解的开端。

💡 早期拼写检查程序:Robert Tappan Morris开发的typo.c程序,通过学习文本中三字母组合的统计数据来检测拼写错误,无需预先指定语言。

📚 整体学习方法:与传统的自然语言处理(NLP)不同,该程序采用整体学习方法,通过学习语料库来理解语言,这与现代大型语言模型(LLMs)的学习方式相似。

⚙️ 语言无关性:该程序具有语言无关性,可以处理不同语言的文本,而无需针对特定语言进行调整,这与传统的基于词典的NLP方法形成对比。

📝 实际应用:该程序在Unix系统中以typo.c的形式存在,通过学习语料库来检测新文本中的错误,展示了早期自然语言理解的开端。

There is a surprising connection between the Morris Worm and ChatGPT.

Model Free Methods Workshop

In my video Model Free Methods Workshop I let the students analyze four problems at a high level, but to do it two different ways for each. First they solve it using a Reductionist Stance, and then using a Holistic Stance.

I knew when I held the workshop that Holism was a dirty word in academia and initially used Model Free Methods as a euphemism.

The last problem is about mind reading. But the first one is in many ways more interesting.

The first task is to write a program to detect spelling errors in text. Not correct them, just detect them.

In the workshop video I ask “What is the first question you have to ask if you are writing a spelling error catching program Reductionistically?”

The surprising (or not) answer is “You have to pick a language”.

If we are creating a spelling error detector using 20th Century NLP, then we would have word lists or dictionaries, and possibly grammars or “WordNet”. And such dictionaries are obviously language specific since they list words in the target language.

Can you do this without specifying the language before you start the project? Well, modern LLMs like GPT already know dozens of languages and can effectively detect all errors in your documents. And some can rewritet them. They are not using NLP, they are using 21st Century NLU – Natural Language Understanding.

How come LLMs can understand multiple languages? It learned them. And systems that learn are Holistic. So there is the clue. Our Holistic Stance solution to detecting spelling errors is a system that is able to learn all the languages we want to use it on.

Now we can ask “What was the first and simplest Learned Language Model?”

The First Language Model

Here is an algorithm that was published many years ago.

The idea is to gather(“learn”) trigram statistics for the target language and then to flag all trigrams that have not been previously encountered (with at least some frequency) in the training corpus. This is not 100% reliable, but it can pragmatically outperform many of the old NLP approaches, and it is language agnostic, which NLP never was.

We note that this algorithm belongs to the category “Statistical Methods for Natural Language Understanding”. Modern LLMs are more “context-exploiting” than they are “statistical”, but that’s another show.

An example shows how this trigram algorithm works in practice: You are about to submit your fourth paper to a yearly conference. How can you flag typos in this paper?

You can feed the three older papers you published through the program in a trigram statistics gathering mode. Then feed the fourth (new) paper through in detect mode and it will point out all words that contain trigrams it has not seen before. This is amazingly effective. No matter how much jargon there is on your papers, if it recurs year to year, it’s not going to be flagged.

The four papers could have been written in French and it would have worked anyway. The system learns some amount of the language from its corpus, which is the first three papers.

Very clever idea. Sounds straightforward. Was it ever made available as a commercial product? Yes, some of the earliest UNIX release tapes contained a program named typo.c that did exactly this. You can find the program on GitHub.

And it is the first program that I know of that learns a Model for a Human Language and then uses it to successfully accomplish a language understanding task. As such, I think it is historically significant.

The author of the program? Robert Tappan Morris, professor at MIT CSAIL, who also created the Morris Worm.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

莫里斯蠕虫 ChatGPT 自然语言理解 整体学习 拼写检查
相关文章