An Outsider’s Roadmap into AI Safety Research (2025)

Published on July 21, 2025 2:03 AM GMT

One month ago, I decided to transition from data science to AI safety research. As someone currently on this journey, I want to share what I've learned about making this transition from an outsider's perspective. Since I'm starting my AI safety career, I’ve leaned on advice from experts and community resources. Any errors are my own. Please feel encouraged to comment with suggestions or improvements!

This post is for those who, like me, look at the field of AI safety with fascination and uncertainty. Maybe you're a software engineer, a physicist, or working in another field, wondering if you could contribute to reducing AI risks. The truth is... You can!

I've written this guide by synthesizing insights from Neel Nanda's mechanistic interpretability guides, 80,000 Hours' career research, and the broader alignment community, while adding my own personal perspective. You should absolutely check out those resources. They offer invaluable complementary perspectives:

A Barebones Guide to Mechanistic Interpretability Prerequisites

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Research skills

AI safety technical research

How to pursue a career in technical AI alignment

Charlie Rogers-Smith

This guide covers up-to-date information on why AI safety has become increasingly urgent, the available roles and research areas, and the theoretical knowledge and practical skills required. I'll also touch on some of the latest developments as of 2025.

A note on scope: This guide focuses on technical AI safety research, the engineering and scientific work aimed at making AI systems safer, specifically in interpretability and alignment. If you're more interested in AI policy or governance, some information will also apply, but you'll want to focus on policy-specific resources.

Why AI Safety, Why Now?

When I first discovered AI safety through Robert Miles and 80,000 hours, one statistic stuck with me, that only a few hundred researchers worldwide dedicate themselves full-time to ensuring powerful AI systems remain aligned with human values (as of 2022). Meanwhile, tens of thousands work on advancing AI capabilities.

The numbers are staggering. For every person asking "How do we make this safe?" there are dozens asking "How do we make this more powerful?"

My work as a data scientist has taught me that even seemingly “simple” ML systems with clear objectives and rigorous testing, can behave in surprising, unintended ways. Now scale that complexity up. We’re deploying systems that can write code, perform strategic reasoning, and shape human behavior, yet we often can’t fully explain how they work, or predict how they’ll act in unfamiliar scenarios.

The landscape has changed drastically in the last year. Recent capability jumps from GPT-3 to o3, the development of reasoning models, and rapid improvements in agentic AI systems, have made transformative AI feel closer. What felt like a distant future problem suddenly feels that we should've started working on it years ago.

Major labs like Anthropic, OpenAI, and Google DeepMind are actively hiring, but the pool of qualified candidates remains tiny. This creates an unusual opportunity. The field is mature enough to offer fulfilling career paths and funding, but young enough that newcomers with the right skills can make contributions.

Let's now understand how the roles in the field have evolved.

Mapping the Territory

Roles

In many research groups, titles still fall into two loose buckets. "Research Engineer" and "Research Scientist". Engineers tend to focus on building systems, coding solutions, and scaling experiments, while scientists more often frame hypotheses and write papers.

In practice, the boundary between both roles has always been fuzzy, but the distinction keeps shrinking. The core problems in AI safety demand a tight, iterative loop between theory and implementation, so the people who thrive are those who move seamlessly between both. They are able to debug a training run and also articulate why a research direction matters for long-term safety.

I’ll call this hybrid profile a full-stack researcher (borrowing the spirit of “full-stack data scientist/engineer”). Today's AI labs seek people who can blend theory and implementation. Meaningful contributions require both deep thinking and effective execution. You might see job postings titled "Research Engineer/Scientist" reflecting this shift.

Critical Domains

The AI safety landscape is vast, but three domains stand out as both accessible to newcomers and crucial for the field's progress. Understanding these will help you choose your initial focus.

Interpretability Research: Reverse-Engineering Intelligence

Imagine being handed a black box that makes life-or-death decisions, and your job is to figure out how it works. That's interpretability research in a nutshell. It is essentially doing neuroscience on artificial minds, trying to understand not just what they do, but how and why they do it.

Current interpretability researchers are tackling questions like:

What do specific neurons and attention heads represent?How do models store and process factual knowledge?Can we identify "circuits" responsible for specific behaviors?

Recent breakthroughs are remarkable. Anthropic's team discovered "features" in Claude 3 Sonnet that correspond to concepts ranging from cities and people to more abstract ideas like deception and bias. Their latest work shows they can trace entire computational circuits, revealing how Claude plans ahead when writing poetry and shares concepts across languages.

Even more concerning, joint research between Anthropic and Redwood Research demonstrated that models can engage in "alignment faking", i.e. strategically deceiving their creators during training. This highlights why understanding these systems' inner workings isn't just scientifically interesting, it's urgent.

As a physicist, this work excites me the most. It's like having a particle accelerator, but for intelligence. Smashing inputs into models and analyzing the activation patterns to understand the fundamental processes at play.

Alignment Research: Engineering AI That Shares Our Values

If interpretability is about understanding AI, alignment is about steering it. This tackles the core challenge: how do we ensure that as AI systems become more powerful, they remain beneficial to humanity?

Alignment research feels like designing incentive systems for incredibly powerful entities. You're part game theorist, part philosopher, part engineer, working on problems like:

Reward hacking

Deceptive alignment

Scalable oversight

Current approaches have evolved rapidly, with major breakthroughs emerging throughout last year. Early techniques like Constitutional AI train models on principles rather than examples. Recently, we are facing concerning new challenges likethe previously discussed "alignment faking". The field now focuses on scalable oversight through debate protocols and weak-to-strong generalization, where weaker models supervise stronger ones.

What excites me most is how alignment research combines rigorous technical work with deep philosophical questions. You need to understand both gradient descent and moral philosophy, both game theory and human values. These researchers aren't just building better algorithms, they're tackling fundamental questions about what we want from intelligence itself.

AI Control and Governance: Technical Policy for Advanced Systems

A rapidly growing domain combines technical expertise with policy understanding. As AI capabilities advance, we need people who can both understand the technical details and translate them for policymakers.

This includes:

Technical AI governance

AI control research

Safety evaluations

WMDP (Weapons of Mass Destruction Proxy) benchmark

Government institutes like the UK's AI Safety Institute and the US NIST AI Safety Institute are actively hiring for these hybrid technical-policy roles.

Choosing Your Focus

Choose interpretability if you love taking things apart to see how they work. You'll thrive if you enjoy rapid experimentation and don't mind that your discoveries might raise more questions than they answer.

Choose alignment if you're motivated by designing robust systems that work even when things go wrong. You'll excel if you like thinking about incentives, edge cases, and "what could go wrong?" scenarios. Great for people with security mindsets or experience in adversarial settings.

Choose technical governance if you want to bridge the gap between cutting-edge research and real-world implementation. You'll succeed if you can translate complex technical concepts for diverse audiences and enjoy working at the intersection of technology and policy.

Remember, these areas are deeply connected. Your choice isn't permanent, it's a starting point for a career that will likely span multiple domains. The key is picking a direction that excites you, then building the comprehensive skill set needed to contribute meaningfully to humanity's most important technical challenge.

The Essential Skill Stack

Based on job postings, researcher conversations, and my own transition experience, here's what you need to build. I've organized this into three phases with concrete checkpoints.

Phase 1: Core Technical Skills

AI safety research lives in code. Whether you're probing transformer internals or scaling alignment experiments, you'll be programming daily. This foundation is non-negotiable.

Programming Excellence

Python mastery is essential. Almost all modern AI work happens in Python. You need comfort with data structures, NumPy operations, and clean coding practices.

Deep learning frameworks: PyTorch dominates research settings. Start with basic neural network training, then work up to transformer implementations, custom layers, handle distributed training and more. You can learn these skills while entering the field.

Software engineering fundamentals: Git workflows, testing, and experiment tracking (wandb). You'll also need to learn to manage GPU memory and debug CUDA errors.

Self-assessment checkpoint: Can you reproduce a simple ML paper from scratch? DeepMind has noted that reproducing papers in "a few hundred hours" indicates research readiness. This isn't just about coding, it forces you to understand experimental nuances and debug when your results don't match.

Mathematical Foundations

You don't need to become a mathematician, but you do need enough mathematical fluency to follow papers and implement algorithms without constantly getting stuck. You can't meaningfully engage with safety research without understanding the mathematical concepts that drive it.

Linear algebra

3Blue1Brown's linear algebra series

Calculus and optimization

3Blue1Brown's calculus series

Probability and statistics

Khan Academy's statistics course

Timeline reality check: If you have a STEM background, you can probably refresh and connect these concepts in a couple of weeks. If you're starting from scratch, you'll probably need 6+ months. And don't feel bad if it takes longer! Math builds on itself, and rushing leads to shaky foundations. A good start is to try applying these concepts to small ML projects, many concepts will click when you see the math working in code.

The goal is to have a solid foundation so that when you read a technical paper or glance at a codebase, the mathematics don't block your understanding of their core ideas.

Deep Learning Fundamentals

This is where theory meets practice. You need to understand not just how to use deep learning, but how it actually works under the hood.

Key milestones:

Implement core architectures from scratch

(Optional bonus)

Create a mini autodiff engine

Master practical training with PyTorch

Implement your MLP in PyTorch, then expand to a small CNN or simple RNN. Train on MNIST, focusing on identifying and debugging common issues like vanishing/exploding gradients, poor initialization, overfitting.Learn about and how to implement mini-batch SGD and Adam, weight decay, dropout, batch/layer normalization, learning rate schedules, gradient clipping.

Deep dive into transformer internals

"Attention Is All You Need"

3Blue1Brown Transformers series

Neel Nanda's exercises

Train and Analyze a toy language model

Train and fine-tune a mini-GPT. This will help you narrow the gap between understanding transformers in theory and working with them in practice.Visualize attention maps, layer activations, and loss curves to build your interpretability intuition.

Self-assessment checkpoint: Can you explain what's happening in each layer of GPT-2? Can you implement attention from scratch? Can you diagnose why a training run is failing from loss curves and gradient norms?

Phase 2: AI Safety Domain Knowledge

This is where we transition from general ML knowledge to understanding what makes AI safety unique. Both alignment and interpretability share core concepts.

We should start with AI Safety Fundamentals. These free courses from BlueDot Impact have trained 4,000+ people, with alumni now working at Anthropic, OpenAI, and the UK's AI Safety Institute. Choose the Alignment track for technical safety approaches or Governance track for policy focus. Applications open twice yearly, but you can access the courses content at any moment. However, taking the courses with a cohort provides valuable peer learning and networking opportunities that solo study can't match.

Core Fundamentals:

Alignment Concepts:

Constitutional AI

RLHF

Reward hacking

Deceptive alignment

Circuit breakers

Alignment faking

Interpretability Concepts

Sparse Autoencoders (SAEs)

extracted millions of interpretable features from Claude 3 Sonnet

Golden Gate Bridge

Superposition

Automated interpretability

Using AI to interpret AI

Hands-On

For alignment-focused work: Start with red teaming, try to break safety-trained models using creative prompts. Work through the StackLLaMA tutorial to understand RLHF's three phases: supervised fine-tuning, reward model training, and RL optimization. Practice creating constitutions (sets of principles) for specific domains.

For interpretability-focused work: Explore pre-trained SAEs on Neuronpedia to see how features activate on different inputs. Work through Neel Nanda's TransformerLens exercises and ARENA tutorials, to understand Mechanistic Interpretability basics. For example, implement activation patching, the technique for finding which model components causally matter for specific outputs.

Community

Start reading and following posts on LessWrong, AI Alignment Forum and the Effective Altruism Forum, then gradually engage. Ask questions, share small experiments, comment thoughtfully on posts.

Self-assessment checkpoint: Can you explain both alignment and interpretability approaches to a friend? Can you run basic red teaming or activation patching experiments? Can you follow recent safety papers without getting lost? Can you explain the current open problems in the field?

Phase 3: Research & Portfolio Development

This phase transforms you from someone who understands AI safety to someone who can contribute to it.

Build in Public

GitHub portfolio: Document everything. Your learning journey becomes your credential. Include:

Paper reproductions with detailed explanationsSmall interpretability experiments (even on toy models)Clear explanations of what you learned and what went wrong

Community contributions: Start answering questions on forums, writing summaries of papers you've read, or creating tutorials for concepts you've mastered. This demonstrates research communication skills.

Research Experience

Research isn’t just about discovering new insights, it’s about creating knowledge and sharing it effectively. Strong communication and collaboration skills are essential for advancing AI safety.

Key abilities:

Critical paper reading: Learn to extract core ideas, evaluate methodologies, and spot limitations in research papers. Regular reading sharpens your research intuition. "How to Read a Paper" by Keshav offers practical strategies for tackling academic literature.

Experimental design: Design experiments with robust baselines, controlled variables, and rigorous statistical evaluation. Reproducing existing papers is a great way to sharpen these skills and understand what makes a good experiment.

Technical communication: Write clearly and concisely about complex topics. Whether it’s a blog post, a research paper, or a presentation, clarity is key. The Elements of Style by Strunk and White is a timeless resource for improving your writing (though it continues to be in my reading list, for now).

MATS program: The ML Alignment & Theory Scholars program in Berkeley offers 10 week research mentorship. With more than 300 alumni and strong placement rates at top labs, it's become a primary pathway into the field. Applications typically open in September and February.

Independent projects: Design your own research projects. Reproduce interpretability papers, try novel visualization techniques, or investigate specific aspects of model behavior.

Collaboration opportunities: Look for researchers posting project ideas on the Alignment Forum or AI Safety Discord. Many established researchers welcome motivated collaborators.

Self-assessment checkpoint: Do you have concrete research outputs others can evaluate? Can you present your work clearly to both technical and non-technical audiences? Are people in the community starting to recognize your contributions?

Conclusion

Most importantly, you don't need to master everything before contributing. Many successful researchers started contributing while still learning. The field values people who can think clearly about important problems, regardless of their formal credentials.

Final Thoughts

As I write this, as I've shared, I'm still only one month into this journey, somewhere between Phase 1 and Phase 2. While the technical bar is high, the community actively wants newcomers to succeed. There are so many resources and opportunities to make this transition.

The path from "fascinated observer" to "contributing researcher" is steep, but it's also more clearly marked than I expected. The resources exist, the community is welcoming, and the need is real. If you've read this far, you already have one of the most important requirements. You're taking the problem seriously.

Finally, I want to thank you for reading to the end. This is my first post on LessWrong, so I am very grateful! I'd welcome any challenges to my reasoning, additional resources I might have missed, or perspectives from those further along in their AI safety journey.

I would also want to thank Claude for helping me editing the post. All ideas, research, and conclusions are my own.

Discuss