少点错误 2024年08月02日
Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章介绍了对CLIP的研究,包括训练稀疏自编码器将其图像嵌入转化为稀疏表示,通过分析找到可解释和可操纵的特征,以实现对图像生成的编辑和引导。

文章通过在CLIP图像嵌入上训练稀疏自编码器,发现其特征似乎映射到单义视觉概念,可将图像分解为组成概念,这些概念可像旋钮一样控制领先模型中的生成。

训练性能方面,使用来自LAION-2B的500万张图像进行训练,选择64倍扩展以获得81,920个学习特征,其中部分特征无效,且在新数据集上达到65%的解释方差。

通过特征激活检查图像,发现SAE学习将主题的特定细微差别编码为不同特征,通过迭代分组的方法改进图像生成的控制,利用激活质量作为重要性的合理代理进行特征分组。

应用方面,许多开源生成图像模型使用CLIP作为扩散图像的条件,该研究的方法对如Stable Cascade等领先开源模型同样有效。

Published on August 1, 2024 9:08 PM GMT


 

This is a follow-up to a previous post on finding interpretable and steerable features in CLIP.

Introduction

CLIP is a neural network commonly used to guide image diffusion. A Sparse Autoencoder was trained on the dense image embeddings CLIP produces to transform it into a sparse representation of active features. These features seem to represent individual units of meaning. They can also be manipulated in groups — combinations of multiple active features — that represent intuitive concepts. These groups can be understood entirely visually, and often encode surprisingly rich and interesting conceptual detail.

By directly manipulating these groups as single units, image generation can be edited and guided without using prompting or language input. Concepts that were difficult to specify or edit by text prompting become easy and intuitive to manipulate in this new visual representation.

Since many models use the same CLIP joint representation space that this work analyzed, this technique works to control many popular image models out of the box.

Summary of Results

Training Sparse Autoencoders on CLIP

 

CLIP is used as part of many popular diffusion models. Credit to Justin P for the graphic.

 

CLIP is a joint image and text embedding model trained using contrastive learning. It consists of two neural networks trained jointly - an image encoder and a text encoder. CLIP is popular in diffusion models (above) as a component to condition the diffusion process based on a text or image input.

We analyzed the outputs of the image encoder model by embedding images from a large dataset. In the previous post[1], we trained a Sparse Autoencoder on CLIP image embeddings and found that the features seemed to map onto monosemantic visual concepts, which we labelled with machine autointerpretation[2].

In this post, the SAE training was improved to try to better steer CLIP embeddings directly. Because there aren’t great metrics for what counts as human interpretable, we chose between runs in the sweep by sampling the feature visualizations from each trained SAE.

We chose a vanilla SAE with a 64x expansion on the 1280-dimensional image embeddings. This results in 81,920 learned features.

This run was trained on a larger variety of 5 million image embeddings from LAION-2B, a dataset about ~3x larger than the run in the previous post.

Training Performance

Some specific details on our training performance:

 

Weights

The trained weights of the sparse autoencoder are open source here.

Inspecting Images by Feature Activations

SAEs trained on CLIP can find hundreds of active features in an image. While these features seem interpretable, manipulating and steering them becomes a challenging endeavor. There are many active features, and multiple features seem to represent the same thing, i.e. multiple similar features for cats or for patterns.

These features, though plentiful, are not actually redundant. We find that the SAE learns to encode specific nuances of a subject as distinct features, something it does less well in a higher sparsity regime. Sparser SAEs tend to have features that more generically encode for the subject matter, combining fewer features for a single concept. This makes them suffer a higher reconstruction error, which makes it more difficult to analyze a particular subject matter in all of its visual detail.

If each feature is indeed actually important to expressing the overall concept, as we claim, then we'll need new ways to work with larger conceptual units to make it easier for humans to use and interpret these features.

Performing Iterated Grouping

Our approach to improving control for image generation is to find groupings of features.

When grouping features, our first aim is to slice them along their natural boundaries in ways that are natural and intuitive to humans. As part of this aim, we also desire a grouping in which each group contributes meaningfully, i.e. to be clearly important to the end result, without small groups that have little to no perceived effect.

Conveniently, we find that activation mass is a reasonable proxy for importance. Features that have a low activation value tend to be less salient than those that have a high value. For instance, features that correspond to fog produce less prominent fog effects as the activation value decreases. Decreasing the value of features about subjects (e.g. an animal in the picture) makes the subject smaller until it disappears.

We can exploit this property by aiming for groupings where each group has a reasonable share of the activation mass. Inspired by previous work on the surprising effectiveness of k-means[3], we also employ k-means to cluster the linear directions of active features.

But how do we choose the value of k — the number of clusters? Some images are more complex than others, and so should be grouped into more clusters in order to form a coherent set of groups. But we also don’t want groups that are too small, which contain information that is relatively unimportant and contributes little to understanding the concepts in the embedding.

Our current approach is to do so by trying a range of values of k, and choosing the clustering where all clusters meet some specific threshold of activation mass. Because activation mass is a reasonable proxy for importance, this helps at least ensure that all groups contribute meaningfully to the overall concept, even if activation mass isn’t distributed evenly across the clusters.

More specifically, this process can be formalized as:

    Encode an input image to obtain sparse feature directions  and activations .Normalize the feature directions to unit length:  for each feature i.Apply spherical k-means to cluster  into k clusters weighted by their activations . This is done for a range of k values to try multiple different number of clusters.Select the largest k for which all clusters meet a minimum activation mass requirement:  where , are the clusters resulting from k-means, and  is a user-selectable threshold from a predefined set of values.

In our dataset, we found  = 20 and  = 5 to be most useful, but this depends on the specific SAE trained.

Feature Visualization

Previous work on other image models (Olah et. al. 2017)[4] focused on feature visualization via optimization.

In this work, because the features we find represent linear directions in CLIP embedding space, we can apply a simpler approach and use pre-trained models that are conditioned on CLIP to visualize the effect of features. We intervene on a sparse representation with a set of features, and to diffuse an image from the resulting embedding, to form an icon.

Doing this visualization requires some tuning in order to optimize for different facets of what we care about when interpreting features:

These specific values were found by iteratively trying and exploring many different visualization approaches. They’re also specific to the diffusion model we chose, and we found the optimal values are slightly different for other diffusion models we tried, like Stable Cascade.

Applications

Many open source generative image models use CLIP as conditioning for the diffusing images. Notably, though the prototype uses Kadinsky 2.2, steering works equally well for leading open source models like Stable Cascade.

Limitations

Related Work

Some related work on Sparse Autoencoders trained on CLIP:

Conclusion

We find that any arbitrary image can be decomposed into its constituent concepts, and these concepts can be used to effectively steer generative image models. Concepts, which are groups of features, can be grouped via k means and manipulated as a unit to yield more interpretable and useful user affordances. This suggests that grouping activating features together could be more widely useful to SAEs in other domains.

Thanks to friends who reviewed a draft of this post, including Linus Lee, David McSharry, Mehran Jalali, Noa Nabeshima, and others.

 

  1. ^
  2. ^
  3. ^
  4. ^
  5. ^
  6. ^


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CLIP 图像生成 稀疏自编码器 特征分组
相关文章