Interpreting the output of neural networks is often challenging because it entails putting into words patterns that may not be easily expressible in human language. This often results in forced explanations that do not reflect the true decision-making process of the model. However, for CLIP-ViT models there is a natural way to map image features of each component of the Transformer network to text-based concepts.Contrastive Language Image Pre-training (CLIP) [Rad21L] has been the dominant learning approach forpre-training multimodal models in the vision-language domain. Given a set ofimage-text pairs, CLIP models are trained to maximize the cosine similaritybetween the embeddings of the image and the text. If the training dataset islarge enough and well curated [Xu23D], the resultingmodels can achieve impressive performance on a wide range of tasks, fromsegmentation, to zero-shot image classification, to even image generation (e.g.as famously done in DALL-E.The interpretability of machine learning models, and neural networks inparticular, is often centered around “putting into words” why a model makes acertain decision. This has proved to be extremely challenging, with mostacclaimed methods often creating a false sense of understanding, as we discussin our series on explainable AI. However,models that are inherently multimodal offer a natural way to map image featuresto text and so provide a more interpretable output.A recent paper [Gan24I] proposes a method todecompose the image representation of CLIP into a set of text-based concepts(see Figure 1). Specifically, it focuses on CLIP-ViT, which uses aVision Transformer (ViT) as the image encoder [Dos20I].Figure 1. Some attention heads inCLIP-ViT seem to focus on extracting specific patterns. The figure shows thetraining images that have the highest cosine similarity with a given input image atthe level of a specific attention head. For each head we can indeed find acommon element, captured by a one-word textual description (Figure 5 in [Gan24I])Given an image $I$ and text descriptions $t$, CLIP methods use two encoders,$M{text}$ for text and $M{image}$ for images that map to a shared latentspace. During training the cosine similarity between corresponding pairs ofimage and text embeddings are maximized, while the similarity betweennon-corresponding pairs is minimized.For image representation $M{image}(I)$ many architectures have been studied,with ViT being one of the most successful. In short, ViT divides the image intopatches, flattens them, and then processes them through a Transformer network,which is a series of multi-head self-attention layers (MSA) followed byfeed-forward layers (FFN). Formally, if $Z^l$ is the output of the $l$-th layerof the Vision Transformer (with $Z^0$ the input image patches) we can write:\begin{equation}ViT(I) = Z^0 + \sum{l=1}^{L}\text{MSA}^l(Z^{l-1}) + \text{FFN}^l(\text{MSA}^l(Z^{l-1}) + Z^{l-1})\tag1\end{equation}where $L$ is the total number of layers.Figure 2. Representation of the VisionTransformer architecture. The input image is first split into patches, whichthen are projected onto a smaller space and to which positional embeddings areadded. The result is fed to a standard Transformer encoder and finally to an MLPhead, used for classification. (Figure 1 in [Dos20I])Then, the Vision Transformer output is projected onto the text embedding space.So if $P \in \mathbb{R}^{d’ \times d}$ denotes the projection matrix,$M_{image}(I) = P \cdot ViT(I)$ with $d’$ chosen to be the same as thedimensionality of the text encoder output. Note that the projection matrix $P$is learned during training. See Figure 2 for a representation of theCLIP-ViT architecture.Figure 3. Direct effects on model performanceof MSA layers. By progressively replacing the output of the MSA layers withtheir average (cumulative, starting from early layers up to the layer indicatedon the $x$ axis), the classification accuracy only drops substantially uponmodifying the later layers. This is true for all the three ViT architecturesconsidered (Figure 2 in [Gan24I])Equation 1 offers a natural way to study the direct effect of each layerof the Vision Transformer on the final image representation. One way to do thisis through mean-ablation, i.e. by replacing the output of a specific component(e.g. a layer or even a single attention head) with its mean value (calculatedacross the training dataset) and measuring the drop in performance.The first interesting observation of the paper is that the large majority of thedirect effects in the ViT encoder come from attention layers in the later stagesof the architecture (see Figure 3), while simultaneously mean-ablatingthe direct effects of the MLP layers only has a marginal impact on the model’saccuracy (only $1$ - $3$ % drop, see Figure 4.The paper therefore proceeds to study only the late attention layers, andre-writes the MSA contribution in Equation 1 as: $$ \sum{l=1}^{L} P\cdot \text{MSA}^l(Z^{l-1}) = \sum{l=1}^{L} \sum{h=1}^{H} \sum{i=0}^{N}c_{l,h,i} $$ where $H$ is the number of attention heads and $N$ is the number ofpatches.Figure 4. Mean ablating all the FFN layers of the network only causes asmall drop in the zero-shot classification accuracy (Table 1 in [Gan24I])The contribution of the attention blocks can therefore be expressed as a sum ofthe $c{l,h,i}$ terms, each of which represents a single attention head $h$ inlayer $l$. Importantly, each of the $c{l,h,i}$ is a $d’$-dimensional vector andlives in the same space as the text embeddings. Calculating the cosinesimilarity between $c{l,h,i}$ and the text embedding $M{text}(t)$ provides anatural way to interpret the effect of each component of the ViT.At first, the paper focuses on the aggregated effects of each attention head,i.e. $c{head}^{l,h} = \sum{i=0}^N c{i,l,h}$. To do so, one can take randomlyselected set of input images and a large set of text descriptions $M{text}(tj) $ for $j = 1, …, J $ and project $c{head}^{l,h}$ onto thedirection of each text embedding. Depending on the size of the projection, onecan then identify the text descriptions that are most descriptive of theattention head. For the exact details of the implementation, refer to “Algorithm1: TextSpan” in [Gan24I].Figure 5. Examples of textrepresentations for the reported attention heads. For each, the four images withthe highest similarity between $c_{head}^{l,h}$ and the text are also reported(Figure 4 in [Gan24I])Figures 1 and 5 show some of the results of thisdecomposition. Figure 6 also offers a more fine-grained analysis of theattention heads, presenting heatmaps of the image patches with the highestcosine similarity to the text embeddings.The paper concludes with some limitations of this approach. For a start, onlydirect influences are studied, but the predictions of each layer do not happenin isolation and changes in the early layers can impact the values at laterstages. Additional insight could come from studying higher order interactions,at the expense of simplicity and potentially interpretability.Figure 6. For the reported three heads, thisimage shows the descriptions with the highest (green border) and lowest (redborder) similarity between $c_{head}^{l,h}$ and the provided text. Taking thetop left case as an example: within this head, which specialises in geolocation,the image patches that determine that the photo was taken in Paris are also thesame that negatively affect the probability that it was taken in the Australianbushlands (Figure 6 in [Gan24I])More importantly, however, not all heads have clear roles. This can bealleviated by adding more candidate text descriptions, but there is also thechance that some parts of the model lack a coherent interpretation when taken inisolation.CLIP models have had a remarkable impact on the field of multimodal learning,and studies like this help make them more understandable on a component bycomponent level. The hope is that this will lead to more robust andinterpretable models in the future, inverting a trend that has seen explainableAI techniques remain of limited use outside academia.