AI News 04月23日 20:17
How does AI judge? Anthropic studies the values of Claude
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Anthropic 的研究团队开发了一种隐私保护方法,用于观察和分类 AI 模型 Claude 在实际应用中表现出的价值观。研究分析了 70 万次用户对话,揭示了 Claude 表达的价值观,包括实用、认知、社会、保护和个人价值观。研究结果表明,Claude 的价值观与 Anthropic 的“有用、诚实、无害”的目标大体一致,但也发现了罕见的与训练目标相悖的现象。该研究方法有助于监测 AI 行为,并为检测 AI 滥用提供预警,同时强调了 AI 价值观理解的重要性。

🔍 Anthropic 开发了一种隐私保护方法,用于分析 AI 模型 Claude 在实际用户对话中表达的价值观,从而深入了解 AI 的行为表现。

📊 研究分析了 70 万次来自 Claude.ai 用户的匿名对话,筛选出 308,210 次对话进行价值观分析,揭示了 Claude 表达的价值观,并按出现频率排序。

💡 研究发现 Claude 主要表达五大类价值观:实用、认知、社会、保护和个人价值观,这些价值观与 Anthropic 的“有用、诚实、无害”目标大体一致。

⚠️ 尽管 Claude 的价值观与训练目标一致,但研究也发现了罕见的与训练目标相悖的价值观表达,这可能与用户绕过 AI 的安全防护有关。

🔄 研究表明,Claude 会根据不同的对话情境调整其价值观表达,例如在处理情感咨询时,会强调“健康界限”和“相互尊重”等价值观。

AI models like Anthropic Claude are increasingly asked not just for factual recall, but for guidance involving complex human values. Whether it’s parenting advice, workplace conflict resolution, or help drafting an apology, the AI’s response inherently reflects a set of underlying principles. But how can we truly understand which values an AI expresses when interacting with millions of users?

In a research paper, the Societal Impacts team at Anthropic details a privacy-preserving methodology designed to observe and categorise the values Claude exhibits “in the wild.” This offers a glimpse into how AI alignment efforts translate into real-world behaviour.

The core challenge lies in the nature of modern AI. These aren’t simple programs following rigid rules; their decision-making processes are often opaque.

Anthropic says it explicitly aims to instil certain principles in Claude, striving to make it “helpful, honest, and harmless.” This is achieved through techniques like Constitutional AI and character training, where preferred behaviours are defined and reinforced.

However, the company acknowledges the uncertainty. “As with any aspect of AI training, we can’t be certain that the model will stick to our preferred values,” the research states.

“What we need is a way of rigorously observing the values of an AI model as it responds to users ‘in the wild’ […] How rigidly does it stick to the values? How much are the values it expresses influenced by the particular context of the conversation? Did all our training actually work?”

Analysing Anthropic Claude to observe AI values at scale

To answer these questions, Anthropic developed a sophisticated system that analyses anonymised user conversations. This system removes personally identifiable information before using language models to summarise interactions and extract the values being expressed by Claude. The process allows researchers to build a high-level taxonomy of these values without compromising user privacy.

The study analysed a substantial dataset: 700,000 anonymised conversations from Claude.ai Free and Pro users over one week in February 2025, predominantly involving the Claude 3.5 Sonnet model. After filtering out purely factual or non-value-laden exchanges, 308,210 conversations (approximately 44% of the total) remained for in-depth value analysis.

The analysis revealed a hierarchical structure of values expressed by Claude. Five high-level categories emerged, ordered by prevalence:

    Practical values: Emphasising efficiency, usefulness, and goal achievement.Epistemic values: Relating to knowledge, truth, accuracy, and intellectual honesty.Social values: Concerning interpersonal interactions, community, fairness, and collaboration.Protective values: Focusing on safety, security, well-being, and harm avoidance.Personal values: Centred on individual growth, autonomy, authenticity, and self-reflection.

These top-level categories branched into more specific subcategories like “professional and technical excellence” or “critical thinking.” At the most granular level, frequently observed values included “professionalism,” “clarity,” and “transparency” – fitting for an AI assistant.

Critically, the research suggests Anthropic’s alignment efforts are broadly successful. The expressed values often map well onto the “helpful, honest, and harmless” objectives. For instance, “user enablement” aligns with helpfulness, “epistemic humility” with honesty, and values like “patient wellbeing” (when relevant) with harmlessness.

Nuance, context, and cautionary signs

However, the picture isn’t uniformly positive. The analysis identified rare instances where Claude expressed values starkly opposed to its training, such as “dominance” and “amorality.”

Anthropic suggests a likely cause: “The most likely explanation is that the conversations that were included in these clusters were from jailbreaks, where users have used special techniques to bypass the usual guardrails that govern the model’s behavior.”

Far from being solely a concern, this finding highlights a potential benefit: the value-observation method could serve as an early warning system for detecting attempts to misuse the AI.

The study also confirmed that, much like humans, Claude adapts its value expression based on the situation.

When users sought advice on romantic relationships, values like “healthy boundaries” and “mutual respect” were disproportionately emphasised. When asked to analyse controversial history, “historical accuracy” came strongly to the fore. This demonstrates a level of contextual sophistication beyond what static, pre-deployment tests might reveal.

Furthermore, Claude’s interaction with user-expressed values proved multifaceted:

Limitations and future directions

Anthropic is candid about the method’s limitations. Defining and categorising “values” is inherently complex and potentially subjective. Using Claude itself to power the categorisation might introduce bias towards its own operational principles.

This method is designed for monitoring AI behaviour post-deployment, requiring substantial real-world data and cannot replace pre-deployment evaluations. However, this is also a strength, enabling the detection of issues – including sophisticated jailbreaks – that only manifest during live interactions.

The research concludes that understanding the values AI models express is fundamental to the goal of AI alignment.

“AI models will inevitably have to make value judgments,” the paper states. “If we want those judgments to be congruent with our own values […] then we need to have ways of testing which values a model expresses in the real world.”

This work provides a powerful, data-driven approach to achieving that understanding. Anthropic has also released an open dataset derived from the study, allowing other researchers to further explore AI values in practice. This transparency marks a vital step in collectively navigating the ethical landscape of sophisticated AI.

See also: Google introduces AI reasoning control in Gemini 2.5 Flash

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post How does AI judge? Anthropic studies the values of Claude appeared first on AI News.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Anthropic Claude AI价值观 AI安全 机器学习
相关文章