The Gradient 2024年11月26日
Car-GPT: Could LLMs finally make self-driving cars happen?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型(LLM)在自动驾驶领域的应用潜力。文章首先介绍了LLM的基本原理,包括标记化、Transformer架构和语言处理等概念。随后,深入分析了LLM在自动驾驶各个环节中的应用,例如感知、规划、决策等。文章指出,LLM可以通过处理图像、传感器数据等信息,识别环境中的物体、预测行驶路径,并最终生成驾驶指令。虽然目前LLM在自动驾驶领域的研究仍处于早期阶段,但其强大的语言理解和生成能力为解决自动驾驶难题提供了新的思路和可能性。

🤔 **LLM的基本原理:**LLM通过将文本转换为标记(token),利用Transformer架构进行语言处理,实现对文本的理解和生成。例如,ChatGPT将输入的文本转换为一系列数字,然后利用Transformer模型进行处理,最终生成相应的文本输出。

🚗 **LLM在自动驾驶感知中的应用:**LLM可以处理图像和传感器数据,识别环境中的物体(如车辆、行人、道路标志等),并进行预测和追踪。例如,PromptTrack模型结合了目标检测和LLM,可以识别图像中的物体并分配唯一ID,类似于4D感知模型。

🗺️ **LLM在自动驾驶规划和决策中的应用:**LLM可以根据感知信息,预测最佳驾驶决策,例如选择行驶路径、判断是否需要让行等。例如,Talk2BEV模型将感知信息与LLM结合,生成鸟瞰图并预测驾驶路径,实现了对驾驶决策的优化。

❓ **LLM在自动驾驶问答系统中的应用:**LLM可以构建一个基于场景的问答系统,用户可以通过自然语言与系统交互,获取相关信息。例如,用户可以询问当前环境中有哪些物体,或者询问下一步应该如何行驶。

🔄 **LLM在自动驾驶数据生成中的应用:**LLM可以生成各种驾驶场景和训练数据,用于提升自动驾驶模型的性能。例如,LLM可以生成不同的道路状况、天气条件和交通状况,为自动驾驶模型提供更丰富的训练数据。

In 1928, London was in the middle of a terrible health crisis, devastated by bacterial diseases like pneumonia, tuberculosis, and meningitis. Confined in sterile laboratories, scientists and doctors were stuck in a relentless cycle of trial and error, using traditional medical approaches to solve complex problems.

This is when, in September 1928, an accidental event changed the course of the world. A Scottish doctor named Alexander Fleming forgot to close a petri dish (the transparent circular box you used in science class), which got contaminated by mold. This is when Fleming noticed something peculiar: all bacteria close to the moisture were dead, while the others survived.

"What was that moisture made of?" wondered M. Flemming. This was when he discovered that Penicillin, the main component of the mold, was a powerful bacterial killer. This led to the groundbreaking discovery of penicillin, leading to the antibiotics we use today. In a world where doctors were relying on existing well-studied approaches, Penicillin was the unexpected answer.

Self-driving cars may be following a similar event. Back in the 2010s, most of them were built using what we call a « modular » approach. The software « autonomous » part is split into several modules, such as Perception (the task of seeing the world), or Localization (the task of accurately localize yourself in the world), or Planning (the task of creating a trajectory for the car to follow, and implementing the « brain » of the car). Finally, all these go to the last module: Control, that generates commands such as « steer 20° right », etc… So this was the well-known approach.

But a decade later, companies started to take another discipline very seriously: End-To-End learning. The core idea is to replace every module with a single neural network predicting steering and acceleration, but as you can imagine, this introduces a black box problem.

The 4 Pillars of Self-Driving Cars are Perception, Localization, Planning, and Control. Could a Large Language Model replicate them? (source)

These approaches are known, but don’t solve the self-driving problem yet. So, we could be wondering: "What if LLMs (Large Language Models), currently revolutionizing the world, were the unexpected answer to autonomous driving?"

This is what we're going to see in this article, beginning with a simple explanation of what LLMs are and then diving into how they could benefit autonomous driving.

Preamble: LLMs-what?

Before you read this article, you must know something: I'm not an LLM pro, at all. This means, I know too well the struggle to learn it. I understand what it's like to google "learn LLM"; then see 3 sponsored posts asking you to download e-books (in which nothing concrete appears)... then see 20 ultimate roadmaps and GitHub repos, where step 1/54 is to view a 2-hour long video (and no one knows what step 54 is because it's so looooooooong).

So, instead of putting you through this pain myself, let's just break down what LLMs are in 3 key ideas:

    TokenizationTransformersProcessing Language

Tokenization

In ChatGPT, you input a piece of text, and it returns text, right? Well, what's actually happening is that your text is first converted into tokens.

Example of tokenization of a sentence, each word becomes a "token"

But what's a token? You might ask. Well, a token can correspond to a word, a character, or anything we want. Think about it -- if you are to send a sentence to a neural network, you didn't plan on sending actual words, did you?

The input of a neural network is always a number, so you need to convert your text into numbers; this is tokenization.

What tokenization actually is: A conversion from words to numbers

Depending on the model (ChatGPT, LLAMA, etc...), a token can mean different things: a word, a subword, or even a character. We could take the English vocabulary and define these as words or take parts of words (subwords) and handle even more complex inputs. For example, the word « a » could be token 1, and the word « abracadabra » would be token 121.

Transformers

Now that we understand how to convert a sentence into a series of numbers, we can send that series into our neural network! At a high level, we have the following structure:

A Transformer is an Encoder-Decoder Architecture that takes a sequence of tokens as input and outputs a another series of tokens

If you start looking around, you will see that some models are based on an encoder-decoder architecture, some others are purely encoder-based, and others, like GPT, are purely decoder-based.

Whatever the case, they all share the core Transformer blocks: multi-head attention, layer normalization, addition and concatenation, blocks, cross-attention, etc...

This is just a series of attention blocks getting you to the output. So how does this word prediction work?

The output/ Next-Word Prediction

The Encoder learns features and understands context... But what does the decoder do? In the case of object detection, the decoder is predicting bounding boxes. In the case of segmentation, the decoder is predicting segmentation masks. What about here?

In our case, the decoder is trying to generate a series of words; we call this task "next-word prediction".

Of course, it does it similarly by predicting numbers or tokens. This characterizes our full model as shown below,

I would say the loss function for this particular output produces a near-0 value.

Now, there are many "concepts" that you should learn on top of this intro: everything Transformer and Attention related, but also few-shot learning, pretraining, finetuning, and more...

Ok... but what does it have to do with self-driving cars? I think it's time to move to stage 2.

Chat-GPT for Self-Driving Cars

The thing is, you've already been through the tough part. The rest simply is: "How do I adapt this to autonomous driving?". Think about it; we have a few modifications to make:

So, let's begin with the end:

What self-driving car tasks could LLM solve?

There are many tasks involved in autonomous driving, but not all of them are GPT-isable. The most active research areas in 2023 have been:

LLMs in Perception

In Perception, the input is a series of images, and the output is usually a set of objects, lanes, etc... In the case of LLMs, we have 3 core tasks: Detection, Prediction, and Tracking. An example with Chat-GPT, when you send it an image and ask to describe what's going on is shown below:

A GPT-4 Vision model can return the objects in the image, just like object detectors do (source)

Other models such as HiLM-D and MTD-GPT can also do this, some work also for videos. Models like PromptTrack, also have the ability to assign unique IDs (this car in front of me is ID #3), similar to a 4D Perception model.

PromptTrack combines the DETR object detector with Large Language Models

In this model, multi-view images are sent to an Encoder-Decoder network that is trained to predict annotations of objects such as bounding boxes, and attention maps). These maps are then combined with a prompt like 'find the vehicles that are turning right'.The next block then finds the 3D Bounding Box localization and assigns IDs using a bipartite graph matching algorithm like the Hungarian Algorithm.

This is cool, but this isn't the "best" application of LLMs so far:

LLMs in Decision Making, Navigation, and Planning

If Chat-GPT can find objects in an image, it should be able to tell you what to do with these objects, shouldn't it? Well, this is the task of Planning i.e. defining a path from A to B, based on the current perception. While there are numerous models developed for this task, the one that stood out to me was Talk2BEV:

Talk2BEV takes perception one step further and also tells you what to do

The main difference between models for planning and Perception-only models is that here, we're going to train the model on human behavior to suggest ideal driving decisions. We're also going to change the input from multi-view to Bird Eye View since it is much easier to understand.

This model works both with LLaVA and ChatGPT4, and here is a demo of the architecture:

Talk2BEV (source)

As you can see, this isn't purely "prompt" based, because the core object detection model stays Bird Eye View Perception, but the LLM is used to "enhance" that output by suggesting to crop some regions, look at specific places, and predict a path. We're talking about "language enhanced BEV Maps".

Other models like DriveGPT are trained to send the output of Perception to Chat-GPT and finetune it to output the driving trajectory directly.

The DriveGPT model is pure madness... when trained correctly! (modified from source)

I could go on and on, but I think you get the point. If we summarize, I would say that:

The Q&A applications are very similar, so let's see the last application of LLMs:

LLMs for Image Generation

Ever tried Midjourney and DALL-E? Isn’t it super cool? Yes, and there is MUCH COOLER than this when it comes to autonomous driving. In fact, have you heard of Wayve's GAIA-1 model? The model takes text and images as input and directly produces videos, like this:

These videos are generated by Wayve's GAIA-1 model

The architecture takes images, actions, and text prompts as input, and then uses a World Model (an understanding of the world and its interactions) to produce a video.

Architecture of GAIA-1 (source)

You can find more examples on Wayve's YouTube channel and this dedicated post.

Similarly, you can see MagicDrive, which takes the output of Perception as input and uses that to generate scenes:

(source)

Other models, like Driving Into the Future and Driving Diffusion can directly generate future scenarios based on the current ones. You get the point; we can generate scenes in an infinite way, get more data for our models, and have this endless positive loop.

We've just seen 3 prominent families of LLM usage in self-driving cars: Perception, Planning, and Generation. The real question is...

Could we trust LLMs in self-driving cars?

And by this, I mean... What if your model has hallucinations? What if its replies are completely absurd, like ChatGPT sometimes does? I remember, back in my first days in autonomous driving, big groups were already skeptical about Deep Learning, because it wasn't "deterministic" (as they call it).

We don't like Black Boxes, which is one of the main reasons End-To-End will struggle to get adopted. Is ChatGPT any better? I don't think so, and I would even say it's worse in many ways. However, LLMs are becoming more and more transparent, and the black box problem could eventually be solved.

To answer the question "Can we trust them?"... it's very early in the research, and I'm not sure someone has really used them "online" — meaning « live », in a car, on the streets, rather than in a headquarter just for training or image generation purpose.  I would definitely picture a Grok model on a Tesla someday just for Q&A purposes. So for now, I will give you my coward and safe answer...

It's too early to tell!

Because it really is. The first wave of papers mentioning LLMs in Self-Driving Cars is from mid-2023, so let's give it some time. In the meantime, you could start with this survey that shows all the evolutions to date.

Alright, time for the best part of the article...

The LLMs 4 AD Summary

Next Steps

If you want to get started on LLMs for self-driving cars, there are several things you can do:


Author Bio

Jérémy Cohen is a self-driving car engineer and founder of Think Autonomous, a platform to help engineers learn about cutting-edge technologies such as self-driving cars and advanced Computer Vision. In 2022, Think Autonomous won the price for Top Global Business of the Year in the Educational Technology Category​ and Jeremy Cohen was named 2023 40 Under 40 Innovators in Analytics Insight magazine, the largest printed magazine on Artificial Intelligence. You can join 10,000 engineers reading his private daily emails on self-driving cars here.

Citation

For attribution in academic contexts or books, please cite this work as

Jérémy Cohen, "Car-GPT: Could LLMs finally make self-driving cars happen?", The Gradient, 2024.

BibTeX citation:

@article{cohen2024cargpt,    author = {Jérémy Cohen},    title = {Car-GPT: Could LLMs finally make self-driving cars happen?},    journal = {The Gradient},    year = {2024},    howpublished = {\url{https://thegradient.pub/car-gpt},}

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 自动驾驶 LLM 感知 规划
相关文章