少点错误 前天 18:32
[Paper] Automated Feature Labeling with Token-Space Gradient Descent
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种利用token空间梯度下降自动标注神经网络特征的新方法。该方法不是依赖大型语言模型(LLM)生成特征含义的假设,而是直接优化特征标签本身。实验证明,该方法成功标注了“动物”、“哺乳动物”、“中文”和“数字”等特征。尽管该方法存在单token标签、层级分类问题和计算成本等局限性,但为神经网络可解释性研究提供了新的思路。

🧠 新方法的核心在于**直接优化特征标签**,而非依赖LLM生成假设。通过将特征标注转化为优化问题,使用LLM评估token与特征描述的匹配程度,从而实现梯度下降优化。

📈 该方法在**合成特征**上表现出色。实验成功为“动物”、“哺乳动物”、“中文”和“数字”等特征找到了合适的单token标签,验证了其可行性。

⚠️ 然而,该方法也存在**局限性**。例如,当训练数据不平衡时,可能收敛到更宽泛的类别;此外,该方法目前仅支持单token标签,且对模型分类能力有一定要求。

🛠️ 未来,该方法可以通过**优化多token序列、改进采样策略和提升模型能力**来解决现有问题,从而提高特征标注的准确性和适用性。

Published on April 30, 2025 10:22 AM GMT

This post gives a brief overview and some personal thoughts on a new ICLR workshop paper that I worked on together with Seamus..

TLDR:

In this project, we developed a proof of concept for a novel way to automatically label features that directly optimizes the feature label using token-space gradient descent. We show its performance on several synthetic toy-features. We have discontinued developing this method because it didn't perform as well as we had hoped, and I'm now more excited about research in other directions.

Automated Feature Labeling

A central method for Mechanistic Interpretability is decomposing neural network activations into linear features via methods like Sparse Auto Encoders, Transcoders, or Crosscoders. While these decompositions give you somewhat human-understandable features, they don't provide an explanation of what these features actually mean. Because there are so many features in modern LLMs, we need some automatic way to find these descriptions. This is where automatic feature labeling methods come in.

Previous Methods

Previous methods for automated feature labeling work by finding text in which specific tokens activate the feature, and then prompting LLMs to come up with explanations for what the feature might mean to be activated in these situations. Once you get such a hypothesis, you can validate it by having an LLM predict when it thinks this feature should be active, and checking that against its actual activation. If these predictions don't match the actual activation, you can prompt an LLM again with the counterexamples and ask it to come up with new feature labels.

This process crucially depends on LLMs generating correct hypotheses for feature labels given the right data. This might be problematic because:

So we developed a method where LLMs only do the rating—they only take a feature label and see what activations they would predict given this label—but the hypotheses themselves are generated via gradient descent.

Our method

Our method is fundamentally different from previous approaches. Instead of asking an LLM to generate hypotheses about what a feature means, we use the LLM as a discriminator to evaluate potential labels.

The key insight is that we can frame feature labeling as an optimization problem. A good feature label is one that allows accurate prediction of when a feature will activate. By using an LLM to assess whether a token matches a given feature description, we create a differentiable pipeline that enables gradient-based optimization of the label itself.

Here's how it works:

    We represent a potential feature label as a probability distribution over tokens.For each token in our dataset, we ask the LLM: "Is this token an example of [feature label]?"We compare the LLM's binary prediction to the actual feature activation.We use gradient descent to update our token probability distribution to minimize prediction error.

This process is visualized in the plot above. The LLM predicts whether each token matches the feature description, and we compare these predictions to actual feature activation (here we have a synthetic feature, that is always 0, except it is 1 for tokens describing animals). The bottom plot shows the optimization trajectory in token-space, where the probability mass concentrates on the token "animal" after several hundred steps.

Our loss function combines three components:

By optimizing these objectives simultaneously, we can find interpretable and accurate feature labels.

Results

We tested our method on several synthetic features and found that it successfully converged to meaningful single-token labels in many cases:

The optimization trajectories show how token probability distributions evolve during training for successfully labeled features:

However, we also encountered limitations and failure cases:

Next Steps

Several limitations could be addressed in future work:

    Single-token labels: Currently, our method only optimizes for single tokens, which limits the complexity of concepts it can express. This could be improved by optimizing sequences of multiple tokens.Hierarchical categories: When insufficient examples of distinguishing cases are present, the method tends to default to broader categories. This could be addressed with more sophisticated sampling strategies.Model capability: If the evaluation model can't reliably classify instances of a feature even when given the correct label, our method will fail.Hyperparameter tuning: The method currently requires some human intervention to tune hyperparameters for different features.
     

It's pretty cool that this method works at all. We think more work in this direction could be exciting, particularly for safety-critical interpretability tasks where having an alternative to fully LLM-generated explanations might be valuable.

However, we're not super hyped about expanding the method ourselves because:

If anyone is interested in working on this approach, our code is available at GitHub. We're happy to chat more about details, including prompts, datasets, and hyperparameters that can be found in the paper.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

神经网络 特征标注 梯度下降 可解释性
相关文章