Interpreting CLIP's Image Representation via Text-based Decomposition

Content feed of the TransferLab — appliedAI Institute 2024年11月27日

Interpreting CLIP's Image Representation via Text-based Decomposition

CLIP-ViT模型通过将图像特征映射到文本概念，提供了一种自然的方式来解释其决策过程。文章探讨了CLIP模型的训练方式，以及CLIP-ViT模型中Vision Transformer的架构。研究表明，CLIP-ViT模型的注意力机制可以识别图像中的特定模式，并将其与文本概念关联起来，从而增强了模型的可解释性。这种方法为理解神经网络的决策过程提供了新的思路，并可能推动未来人工智能模型的开发和应用。

🤔 CLIP模型通过对比学习，将图像和文本映射到一个共享的潜在空间，从而实现图像和文本的关联。训练过程中，最大化对应图像和文本嵌入的余弦相似度，最小化非对应对的相似度。

🖼️ CLIP-ViT模型使用Vision Transformer（ViT）作为图像编码器，将图像分割成多个patch，并通过一系列多头自注意力层（MSA）和前馈层（FFN）进行处理，最终生成图像的特征表示。

🔎 研究发现，CLIP-ViT模型中的注意力机制可以识别图像中的特定模式，例如特定物体或场景。通过分析注意力权重，可以将图像特征映射到相应的文本概念，从而解释模型的决策过程。

💡 这种基于文本分解的方法为理解CLIP-ViT模型的决策过程提供了一种新的思路，有助于提高模型的可解释性和透明度，并推动未来人工智能模型的开发。

📚 文章提出了一种将CLIP-ViT模型的图像表示分解成一系列基于文本的概念的方法，为理解和解释CLIP-ViT模型的决策过程提供了新的视角。

Interpreting the output of neural networks is often challenging because it entails putting into words patterns that may not be easily expressible in human language. This often results in forced explanations that do not reflect the true decision-making process of the model. However, for CLIP-ViT models there is a natural way to map image features of each component of the Transformer network to text-based concepts.Contrastive Language Image Pre-training (CLIP) [Rad21L] has been the dominant learning approach forpre-training multimodal models in the vision-language domain. Given a set ofimage-text pairs, CLIP models are trained to maximize the cosine similaritybetween the embeddings of the image and the text. If the training dataset islarge enough and well curated [Xu23D], the resultingmodels can achieve impressive performance on a wide range of tasks, fromsegmentation, to zero-shot image classification, to even image generation (e.g.as famously done in DALL-E.The interpretability of machine learning models, and neural networks inparticular, is often centered around “putting into words” why a model makes acertain decision. This has proved to be extremely challenging, with mostacclaimed methods often creating a false sense of understanding, as we discussin our series on explainable AI. However,models that are inherently multimodal offer a natural way to map image featuresto text and so provide a more interpretable output.A recent paper [Gan24I] proposes a method todecompose the image representation of CLIP into a set of text-based concepts(see Figure 1). Specifically, it focuses on CLIP-ViT, which uses aVision Transformer (ViT) as the image encoder [Dos20I].Figure 1. Some attention heads inCLIP-ViT seem to focus on extracting specific patterns. The figure shows thetraining images that have the highest cosine similarity with a given input image atthe level of a specific attention head. For each head we can indeed find acommon element, captured by a one-word textual description (Figure 5 in [Gan24I])Given an image $I$ and text descriptions $t$, CLIP methods use two encoders,$M{text}$ for text and $M{image}$ for images that map to a shared latentspace. During training the cosine similarity between corresponding pairs ofimage and text embeddings are maximized, while the similarity betweennon-corresponding pairs is minimized.For image representation $M{image}(I)$ many architectures have been studied,with ViT being one of the most successful. In short, ViT divides the image intopatches, flattens them, and then processes them through a Transformer network,which is a series of multi-head self-attention layers (MSA) followed byfeed-forward layers (FFN). Formally, if $Z^l$ is the output of the $l$-th layerof the Vision Transformer (with $Z^0$ the input image patches) we can write:\begin{equation}ViT(I) = Z^0 + \sum{l=1}^{L}\text{MSA}^l(Z^{l-1}) + \text{FFN}^l(\text{MSA}^l(Z^{l-1}) + Z^{l-1})\tag1\end{equation}where $L$ is the total number of layers.Figure 2. Representation of the VisionTransformer architecture. The input image is first split into patches, whichthen are projected onto a smaller space and to which positional embeddings areadded. The result is fed to a standard Transformer encoder and finally to an MLPhead, used for classification. (Figure 1 in [Dos20I])Then, the Vision Transformer output is projected onto the text embedding space.So if $P \in \mathbb{R}^{d’ \times d}$ denotes the projection matrix,$M_{image}(I) = P \cdot ViT(I)$ with $d’$ chosen to be the same as thedimensionality of the text encoder output. Note that the projection matrix $P$is learned during training. See Figure 2 for a representation of theCLIP-ViT architecture.Figure 3. Direct effects on model performanceof MSA layers. By progressively replacing the output of the MSA layers withtheir average (cumulative, starting from early layers up to the layer indicatedon the $x$ axis), the classification accuracy only drops substantially uponmodifying the later layers. This is true for all the three ViT architecturesconsidered (Figure 2 in [Gan24I])Equation 1 offers a natural way to study the direct effect of each layerof the Vision Transformer on the final image representation. One way to do thisis through mean-ablation, i.e. by replacing the output of a specific component(e.g. a layer or even a single attention head) with its mean value (calculatedacross the training dataset) and measuring the drop in performance.The first interesting observation of the paper is that the large majority of thedirect effects in the ViT encoder come from attention layers in the later stagesof the architecture (see Figure 3), while simultaneously mean-ablatingthe direct effects of the MLP layers only has a marginal impact on the model’saccuracy (only $1$ - $3$ % drop, see Figure 4.The paper therefore proceeds to study only the late attention layers, andre-writes the MSA contribution in Equation 1 as: $$ \sum{l=1}^{L} P\cdot \text{MSA}^l(Z^{l-1}) = \sum{l=1}^{L} \sum{h=1}^{H} \sum{i=0}^{N}c_{l,h,i} $$ where $H$ is the number of attention heads and $N$ is the number ofpatches.Figure 4. Mean ablating all the FFN layers of the network only causes asmall drop in the zero-shot classification accuracy (Table 1 in [Gan24I])The contribution of the attention blocks can therefore be expressed as a sum ofthe $c{l,h,i}$ terms, each of which represents a single attention head $h$ inlayer $l$. Importantly, each of the $c{l,h,i}$ is a $d’$-dimensional vector andlives in the same space as the text embeddings. Calculating the cosinesimilarity between $c{l,h,i}$ and the text embedding $M{text}(t)$ provides anatural way to interpret the effect of each component of the ViT.At first, the paper focuses on the aggregated effects of each attention head,i.e. $c{head}^{l,h} = \sum{i=0}^N c{i,l,h}$. To do so, one can take randomlyselected set of input images and a large set of text descriptions $M{text}(tj) $ for $j = 1, &mldr;, J $ and project $c{head}^{l,h}$ onto thedirection of each text embedding. Depending on the size of the projection, onecan then identify the text descriptions that are most descriptive of theattention head. For the exact details of the implementation, refer to “Algorithm1: TextSpan” in [Gan24I].Figure 5. Examples of textrepresentations for the reported attention heads. For each, the four images withthe highest similarity between $c_{head}^{l,h}$ and the text are also reported(Figure 4 in [Gan24I])Figures 1 and 5 show some of the results of thisdecomposition. Figure 6 also offers a more fine-grained analysis of theattention heads, presenting heatmaps of the image patches with the highestcosine similarity to the text embeddings.The paper concludes with some limitations of this approach. For a start, onlydirect influences are studied, but the predictions of each layer do not happenin isolation and changes in the early layers can impact the values at laterstages. Additional insight could come from studying higher order interactions,at the expense of simplicity and potentially interpretability.Figure 6. For the reported three heads, thisimage shows the descriptions with the highest (green border) and lowest (redborder) similarity between $c_{head}^{l,h}$ and the provided text. Taking thetop left case as an example: within this head, which specialises in geolocation,the image patches that determine that the photo was taken in Paris are also thesame that negatively affect the probability that it was taken in the Australianbushlands (Figure 6 in [Gan24I])More importantly, however, not all heads have clear roles. This can bealleviated by adding more candidate text descriptions, but there is also thechance that some parts of the model lack a coherent interpretation when taken inisolation.CLIP models have had a remarkable impact on the field of multimodal learning,and studies like this help make them more understandable on a component bycomponent level. The hope is that this will lead to more robust andinterpretable models in the future, inverting a trend that has seen explainableAI techniques remain of limited use outside academia.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CLIP CLIP-ViT 可解释性 Vision Transformer 多模态

相关文章

Localizing and Editing Knowledge in LLMs with Peter Hase - #679

Learning Transformer Programs with Dan Friedman - #667

Transformers On Large-Scale Graphs with Bayan Bruss - #641

Studying Machine Intelligence with Been Kim - #571

Trends in Natural Language Processing with Nasrin Mostafazadeh - #337

Real world model explainability with Rayid Ghani - TWiML Talk #283

Fairness in Machine Learning with Hanna Wallach - TWiML Talk #232

Evaluating Model Explainability Methods with Sara Hooker - TWiML Talk #189

Infrastructure for Autonomous Vehicles with Missy Cummings - TWiML Talk #128

Carlos Guestrin - Explaining the Predictions of Machine Learning Models - TWiML Talk #7