2022 is a year of generalist models! With the bloom of multimodal pretraining, especially the unified model, we have witnessed the opportunity to building a generalist model that is capable of processing tasks of different modalities or multi-modalities! Thus, we propose OFA, namely One-For-All, a unified multimodal pretrained model that unifies understanding and generation tasks concerning modalities into a single framework, and we pretrain OFA with the instruction-based multitask-pretraining that endows it with multiple capabilities. We opensourced both the pretrained and finetuned models to the community, hoping this pioneer work can help accelerate the development of generalist models.
Background

Multimodal pretraining has been developing rapidly ever since the transfer of BERT to cross-modal representation learning. Representative studies include UNITER, VilBERT, etc. These studies directly incorporate the Transformer-based BERT to a single-stream or dual stream framework for multimodal pretraining, and transform the image to a sequence of object features to be concatenated with the word embeddings as the input of Transformer. Later in 2021, with the rise of Vision Transformer, there came methods that got rid of object-level features, which depend on complex preprocessing pipelines, say Faster-RCNN: For example, the simplest ViLT based on patch projection, the CLIP-based CLIP-ViL, etc. One milestone after should be the proposal of SimVLM, which leverages the T5/BART method for multimodal pretraining and achieves new SoTA in many tasks. These progress should be regarded as the foundation of unified multimodal pretrained models in 2022, including OFA of ours, Unified-IO, Flamingo, BeiT-3, etc.
Method
What OFA wants to achieve is the unification of tasks, modalities, and architecture. We suppose there are three features for a unified model, i.e., task agnostic, modality agnostic, and task comprehensiveness. To further explain them, “task agnostic” indicates that the unified model should be able to accept tasks without modifying its own architecture and training methods, “modality agnostic” indicates that a unified model should accept inputs of different modalities without knowing what they are and designing complex preprocessing, and “task comprehensiveness” indicates that the unified model should learn as many tasks as possible so that it can transfer to unseen tasks with the composition of existing capabilities. Thus, we propose 3 types of unification for OFA, namely the unification of modalities, architecture, and tasks. Let’s figure them out one by one.
For the unification of modalities, one key issue is the tokenization of inputs of different modalities, or to say, the discretization. Otherwise, there should be other solutions like diffusion models for the generation. There is no need to change the tokenization for texts, but the images and bounding boxes need to be discretized. Owing to the success of vector quantization and text-to-image generation with Transformer, images can be represented with VQ tokens. Inspired by pix2seq, bounding boxes can also be discretized with bins.

We choose the universal Transformer encoder-decoder architecture, due to its successful usages in NLP unified models like T5. Note that for the input of images to the Transformer, we use the first three blocks of ResNet. For the Transformer architecture, we modify the design by incorporating Normformer for the training stability and transfer performance.

The multitask learning is the key innovation of OFA. Specifically, we pretrain the model with 8 tasks, including 5 vision-language tasks, 2 vision tasks, and 1 language task. The vision-language tasks include visual grounding, grounded captioning, visual question answering, image-text matching, and image captioning. The vision tasks include detection and image infilling. The language task is text infilling. To help the model differentiate tasks, we insert an instruction, which is simply a piece of text describing the task. Thus, we expect the model to perform zero-shot generation based on a new instruction indicating an unseen task.

To make this research as reproducible as possible, our pretraining is dependent on public datasets. Therefore, we expect the researchers following this work can reproduce our results with our opensourced code.
We have released OFA models of 5 sizes, including OFA-Tiny (33M), OFA-Medium (93M), OFA-Base (180M), OFA-Large(470M), OFA-Huge (930M). See the table below for more statistics.

Experiments
We have conducted experiments on multiple cross-modal tasks and unimodal tasks. On vision-language understanding, we test the models on VQA and SNLI-VE. We find that the huge-size model can achieve a comparable performance to the 80B-parameter model Flamingo and the 2B-parameter model CoCa pretrained on 5B image-text pairs. Furthermore, we achieve the best performance on visual entailment. For vision-language generation, we focus on the classical image captioning, and our OFA achieves the SoTA performance in both setups of cross-entropy optimization and CIDEr optimization. Also, we have transformed the task of visual grounding to a generation task, and we find that even the base-size OFA can outperform the previous SoTA, and the scaling of model size consistently brings performance improvements. This shows the significance of the unification of modalities and tasks.



Additionally, we test OFA on text-to-image generation, as we believe that the image infilling task in pretraining endows it with the capability to generate image codes. We show that OFA can achieve a low FID score in the evaluation, and further finetuning on a larger dataset can significantly boost its performance. See cases below.


As to the unimodal tasks, we evaluate OFA on the GLUE benchmark for NLU, Gigaword summarization for NLG, and ImageNet classification for vision understanding. We show that OFA can be competitive with both RoBERTa and DeBERTa, and the previous multimodal pretrained model often falls far behind the SoTAs in NLU. Similarly, OFA can achieve good performance on NLG and outperform the previous best models. As to image classification, it can also achieve similar performance with the self-supervised vision models like BeiT and MAE.
We observe that OFA based on multitask pretraining demonstrates potential in transferring to unseen tasks and unseen domains. We show them with two cases below.

The preceding case demonstrates the model’s ability of compositional generalization by understanding the instruction and leveraging two learned capabilities to perform the new task. We set up a new task called Grounded VQA, which is a combination of VQA and grounded captioning. What we need to change is the instruction. The new task instruction with both question and region information directs the model to provide a correct answer.

Also, we find that OFA can transfer to unseen domains effectively. One example is the visual grounding on images of animation. OFA can perform well in this setup as it has been pretrained on some anime data and it has been pretrained on visual grounding on general-domain data. This again shows the compositional ability of the unified model.
Conclusion
This is the starting point of our research for the technically “One-For-All” model, or to say, the generalist model. We show that this research direction is promising as Transformer is a really powerful architecture and tasks and modalities can be unified to a single training framework. Like GPT-3, we believe that there will soon be a powerful foundation model in multimodal representation learning.