Anthropic details its AI safety strategy

AI News 14小时前

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Anthropic公司近日详细阐述了其AI模型Claude的安全策略，旨在确保模型在提供帮助的同时，避免传播有害信息。该策略的核心是Anthropic的“Safeguards”团队，该团队由政策专家、数据科学家、工程师和威胁分析师组成，专注于识别和防范潜在的滥用行为。其安全方法是多层次的防御体系，从制定使用政策到主动追踪新兴威胁。文章详细介绍了使用政策的制定、统一危害框架的应用、外部专家进行的政策漏洞测试，以及在2024年美国大选期间为确保信息准确性而采取的措施。此外，还强调了在模型开发初期就融入安全原则的重要性，通过与危机支持专家的合作，使Claude能够以更具同理心的方式处理敏感对话。模型在发布前会经过安全评估、风险评估和偏见评估等多重严苛测试，以确保其合规性和公平性。模型上线后，通过自动化系统和人工审查相结合的方式进行实时监控，利用专门的“分类器”模型来识别和处理违规行为，并持续追踪大规模滥用趋势，积极与各方合作，共同构建更完善的安全保障。

🗄️ **多层级安全防御体系**：Anthropic的AI安全策略并非单一措施，而是构建了一个多层级的防御体系。这包括明确的使用政策，为Claude的交互设定了行为规范，覆盖了选举诚信、儿童安全以及金融、医疗等敏感领域的负责任使用。同时，通过“统一危害框架”系统性地评估潜在的负面影响，并邀请外部专家进行“政策漏洞测试”，主动发现模型潜在的弱点。

🧠 **安全原则内嵌与专业合作**：安全理念从模型训练的早期阶段就已被融入。Safeguards团队与开发者紧密合作，将安全价值观直接植入Claude模型。此外，Anthropic还与专业机构合作，例如与危机支持领导者ThroughLine合作，教会Claude如何以关怀和恰当的方式处理关于心理健康和自残的敏感对话，而不是简单拒绝，从而使其能够拒绝非法活动、恶意代码编写或诈骗等请求。

🧪 **多维度严苛测试确保上线安全**：在Claude的任何新版本发布前，都会经过严格的多维度评估。这包括“安全评估”，检测模型在复杂对话中的合规性；“风险评估”，针对网络威胁或生物风险等高风险领域进行专项测试，有时还会引入政府和行业伙伴的协助；以及“偏见评估”，确保模型对所有用户都能提供可靠、准确的答案，并测试是否存在政治偏见或基于性别、种族的倾斜回应。这些测试旨在验证训练效果并识别是否需要额外的保护措施。

🔍 **持续监控与威胁狩猎**：Claude上线后，Anthropic通过自动化系统和人工审查相结合的方式进行持续监控。专门的“分类器”模型能实时识别政策违规行为，并触发相应的干预措施，如引导模型规避有害回复或对违规用户发出警告甚至关闭账户。此外，团队还利用隐私保护工具分析使用趋势，通过“分层汇总”等技术识别大规模滥用行为，并积极在不良行为者可能出现的论坛中搜寻新的威胁，展现了其“永不休眠”的安全策略。

Anthropic has detailed its safety strategy to try and keep its popular AI model, Claude, helpful while avoiding perpetuating harms.

Central to this effort is Anthropic’s Safeguards team; who aren’t your average tech support group, they’re a mix of policy experts, data scientists, engineers, and threat analysts who know how bad actors think.

However, Anthropic’s approach to safety isn’t a single wall but more like a castle with multiple layers of defence. It all starts with creating the right rules and ends with hunting down new threats in the wild.

First up is the Usage Policy, which is basically the rulebook for how Claude should and shouldn’t be used. It gives clear guidance on big issues like election integrity and child safety, and also on using Claude responsibly in sensitive fields like finance or healthcare.

To shape these rules, the team uses a Unified Harm Framework. This helps them think through any potential negative impacts, from physical and psychological to economic and societal harm. It’s less of a formal grading system and more of a structured way to weigh the risks when making decisions. They also bring in outside experts for Policy Vulnerability Tests. These specialists in areas like terrorism and child safety try to “break” Claude with tough questions to see where the weaknesses are.

We saw this in action during the 2024 US elections. After working with the Institute for Strategic Dialogue, Anthropic realised Claude might give out old voting information. So, they added a banner that pointed users to TurboVote, a reliable source for up-to-date, non-partisan election info.

Teaching Claude right from wrong

The Anthropic Safeguards team works closely with the developers who train Claude to build safety from the start. This means deciding what kinds of things Claude should and shouldn’t do, and embedding those values into the model itself.

They also team up with specialists to get this right. For example, by partnering with ThroughLine, a crisis support leader, they’ve taught Claude how to handle sensitive conversations about mental health and self-harm with care, rather than just refusing to talk. This careful training is why Claude will turn down requests to help with illegal activities, write malicious code, or create scams.

Before any new version of Claude goes live, it’s put through its paces with three key types of evaluation.

Safety evaluations:

Risk assessments:

Bias evaluations:

This intense testing helps the team see if the training has stuck and tells them if they need to build extra protections before launch.

Anthropic’s never-sleeping AI safety strategy

Once Claude is out in the world, a mix of automated systems and human reviewers keep an eye out for trouble. The main tool here is a set of specialised Claude models called “classifiers” that are trained to spot specific policy violations in real-time as they happen.

If a classifier spots a problem, it can trigger different actions. It might steer Claude’s response away from generating something harmful, like spam. For repeat offenders, the team might issue warnings or even shut down the account.

The team also looks at the bigger picture. They use privacy-friendly tools to spot trends in how Claude is being used and employ techniques like hierarchical summarisation to spot large-scale misuse, such as coordinated influence campaigns. They are constantly hunting for new threats, digging through data, and monitoring forums where bad actors might hang out.

However, Anthropic says it knows that ensuring AI safety isn’t a job they can do alone. They’re actively working with researchers, policymakers, and the public to build the best safeguards possible.

(Lead image by Nick Fewings)

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Anthropic details its AI safety strategy appeared first on AI News.

Teaching Claude right from wrong

Anthropic’s never-sleeping AI safety strategy

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签