Published on July 23, 2025 9:41 PM GMT
Join us for the AI Safety x Physics Grand Challenge, a research hackathon designed to engage physicists in technical AI safety research. While we expect LessWrong community members with both technical AI safety and physics expertise to benefit most from this event, we encourage anyone interested in exploring this intersection to sign up.
Dates: July 25th to July 27th (this weekend)
Location: Remote, with in-person hubs in several locations
Prizes: $2,000 total prize money
Apart Research is running the hackathon, in collaboration with PIBBSS and Timaeus. Hackathon speakers include Jesse Hoogland (Timaeus), Paul Riechers (Simplex), and Dmitry Vaintrob (PIBBSS). Participants will get research support from mentors with expertise spanning the physics and AI safety space, including Martin Biehl, Jesse Hoogland, Daniel Kunin, Andrew Mack, Eric Michaud, Garrett Merz, Paul Riechers, Adam Scherlis, Alok Singh, and Dmitry Vaintrob.
Vision
In an effort to diversify the AI safety research landscape, we aim to leverage a physics perspective to explore novel approaches and identify blind spots in current work. In particular, we think this could make significant progress in narrowing the theory-practice gap in AI safety, which is currently large. Work in this direction is timely, since there are signs that AI safety (interpretability especially) is in need of strong theory to support the wealth of empirical efforts that have so far been leading the field. We think that physics, which uses math but is not math, is our best bet for meeting this need.
Our Approach
As a scientific practice with strong theoretical foundations, physics has deep ties with other mathematically founded disciplines, including computer science. These fields progress largely in parallel, and we see high value in uncovering, re-expressing, and linking established ideas across disciplines and in new contexts. By connecting physicists with AI safety and ML experts, the goal of this Hackathon is to identify and pull at the threads with the highest potential for impact in AI safety. Although there is a lot of good physics for AI literature out there, we take a predominantly ‘problem first’ approach. This is to avoid restricting solutions to specific physics fields, methods, and tools. We’re excited to reframe these old perspectives and find new ones!
We offer two different ways for people to get involved:
Project Track. This is a typical hackathon submission based on the starter materials or an original idea. We expect most of these to provide incremental theoretical progress and/or empirical evidence for (or against) an existing idea. Higher context participants (i.e. those with experience running experiments on NNs or with an existing physics-AI agenda they want to accelerate progress on).
Exploration Track. Part of the idea for this hackathon is to open up and test the limits of a ‘physics for AI safety’ opportunity space. We have thought of some, but likely many more that would be exciting. Instead of traditional ‘hacking’, this will be a written submission, for example:
- A proposal, outlining a novel physics-based solution to a key AI safety problem, with supporting material drawn from prior work.A distillation or literature review bridging a CS/ML/AI idea with one from physics, with clear reference to an AI safety problem area.
We think this track has the potential to brainstorm some truly innovative ideas, while also allowing lower-context participants to get involved. For example, a ‘papers’ participant or group could partner with a ‘project’ group, resulting in a small research gain and a better understanding of the bigger picture.
Core Premise
Physicists are particularly good at:
- Shifting between different methodologies (theory, experiment, simulation)Balancing theoretical rigor with messy, real-world empiricsIdentifying ‘universal’ patterns in complex systems and understanding their underlying assumptionsMulti-scale analysis, in particular knowing when a description should be reductive or emergent (scale separated).Building ‘toy’ models and abstracting useful information from themHandling uncertainty (in a statistical/probabilistic sense, but also intuitively)
However, there are a few traps we could fall into, so it will be important to keep the following in mind:
- What kinds of AI safety questions can physics answer? Perhaps more importantly, what kinds of questions can physics not answer? These can run the gamut from ‘What kinds of explanations can scaling laws provide?’ to ‘Are there fundamental limits to our enumeration/understanding of potential risks of AI systems?’When should physics intuition support assumptions about AI systems, and when should it work to upend or improve them?When are cows really spherical? For example, when can we trust linear or Gaussian approximations in NNs? Similarly, when does a toy model no longer apply?Terms that are meaningful in physics – like universality, causality, and locality – may not cleanly map onto AI systems. We should be careful of conflating an analogy with universal law.
Problem Space
As a preliminary guide for participants, we’ve organized our view of promising general directions within the intersection of physics and AI safety into the following five problem areas (which are by no means comprehensive or fully distinct). Ideas outside of this list are welcome, as long as they speak to both the AI safety problem and physics solution.
We took overall inspiration for open problems from the Open Philanthropy’s Technical AI Safety (TAIS) RFP, the recent Open Problems in Mechanistic Interpretability review, and the research questions laid out in Foundational Challenges in Assuring Alignment and Safety of Large Language Models.
Problem Area 1: Bridging the Theory/Practice Gap for AI Interpretability
The crux: Neural networks are complex systems with broad and flexible representational power. Mechanistic interpretability researchers are actively pursuing methods to reverse-engineer a trained network by decomposing it into human-interpretable parts (for example, methods based on sparse dictionary learning). To date, these methods have met with mixed success, with a central issue being their lack of solid theoretical and conceptual foundations. Can physicists bridge the theory/practice gap? How can theoretical analyses of inductive biases, feature learning, and optimization in neural networks help us create better interpretability tools?
Problem Description:
It’s important to distinguish between ‘interpretability tools’ and ‘interpretability’ in general, and take a broader view (beyond ‘mechanistic’) in defining the latter. In particular, we wish to include more theoretical directions that may be seen as ‘computational interpretability’ (i.e., tracking belief states rather than neural circuits), ‘developmental interpretability’ (i.e., monitoring degeneracies in the learning landscape), or those built on more speculative techniques like renormalization. While mechanistic interpretability focuses on identifying which internal components work together to implement specific behaviors, these directions can help identify the kinds of explanations that could robustly describe and predict these behaviors. In particular, they could help us reach consensus on:
- What counts as a mechanism or featureHow to distinguish real structure from phenomenological artifactsA ‘model natural’ framework for interpreting models in terms of their own internal abstractions, not just human categories.
Connection to Physics:
Physicists are historically good at closing the theory-practice gap, finding structure in messy data, and getting rid of irrelevant degrees of freedom at different levels of abstraction. They have experience tracking degrees of freedom, model parameters, and observables in controlled experiments. In NNs, this is essential for homing in on precise causal relationships between network internals.
Example Research Directions: extracting predictive states from neural representations using computational mechanics; renormalization group techniques for coarse-graining features; testing the natural abstraction hypothesis; formal models of superposition; detecting feature phase transitions with singular learning theory; information bottleneck approaches; geometric properties of sparse autoencoder latent spaces.
Problem Area 2: Learning, Inductive Biases, and Generalization
The crux: Large models often generalize far beyond the regimes where traditional statistical learning theory applies. They memorize noisy data, then suddenly grok patterns. They learn from context at inference time without parameter updates. These behaviors suggest the presence of strong inductive biases—implicit assumptions that shape learning dynamics and generalization—but we don’t yet know how to characterize or control them. Physicists, with their fluency in emergent behavior, dualities, and phase transitions, could offer the theoretical tools to explain why models generalize the way they do—and how we might shape that generalization.
Problem Description:
We want to understand how architecture, initialization, data distribution, and training procedures give rise to inductive biases—and in turn, how these biases control what is learned, when generalization occurs, and why models may fail.
This includes:
- Generalization phenomena in overparameterized regimes (e.g., grokking, double descent)In-context learning, where models appear to learn from inference-time data but the mechanism is poorly understoodInductive biases from architecture, such as attention, equivariance, or modular structureMemorization vs. generalization tradeoffs and phase transitionsFeature vs. kernel regimes, and how to interpolate between them
Our goal is to move beyond empirical curve-fitting and develop physically grounded, predictive models of the learning process—especially those that clarify when generalization will succeed, fail, or change character.
Connection to Physics:
In a sense, physics is the science of inductive bias. It tells us that the universe favors certain configurations over others—symmetric over asymmetric, low-energy over high-energy, local over nonlocal—and builds theories to explain why. Physicists are trained to ask: What regularities are baked into a system? What constraints guide its evolution? Which degrees of freedom matter, and which can be ignored? This makes them well-suited to reasoning about how generalization arises from structural and statistical biases hidden in the architecture, data, and training process.
Example Research Directions: unbalanced initialization and rapid representation learning; new models of grokking and phase transitions; comparing the inductive bias of physics-informed architectures with standard models; generalization in Bayesian statistical learning and the "Bayes quartet"; concept emergence under physical priors like sparsity and locality; applying non-standard analysis to analyze scaling regimes and learning coefficients.
Problem Area 3: Mathematical Models of Data Structure
The crux: Intelligent systems that generalize well often do so by internalizing a model of the world—a latent structure that supports abstraction, prediction, and decision-making. Misalignment arises when this internal model – gleaned from the data – encodes goals, beliefs, or causal relationships incorrectly.
Problem Description: Can we develop useful mathematical models to describe the data structure internally represented by generally intelligent AI systems? How can methods from physics help us better understand abstraction, generalization, world modeling, and transfer learning in AI systems? We seek to:
- Identify when and how natural abstractions emerge from dataModel how different inductive biases interact with the underlying data structureUnderstand how ontological shifts and representation instability relate to misalignment or deception
Connection to Physics: Physicists tackle a similar challenge every day: nature’s fundamental structure is unknown, but we can gradually uncover it by collecting and interpreting experimental data, guided by theory. To do so, physicists develop mathematical models that capture the essence of phenomena.
Example Research Directions: toy models of structured data-generating processes; investigating belief state geometry in transformers trained on sequential data; formal models of latent variables in natural data; dimensionality reduction in high-dimensional spaces; applying singular learning theory to understand learning degeneracies from structured data.
Problem Area 4: Quantitative Bounds on AI Behavior
The crux: A central goal of technical AI safety is to rigorously evaluate alignment – even in regimes where misbehavior might be rare, hard to anticipate or understand, or overlooked during training. But today’s systems offer little in the way of provable guarantees. What is it possible to know about AI systems, their behavior, and various asymptotic limits?
Problem Description:
To what extent can we predict quantitative bounds on the behavior of powerful AI systems? The end goal is not just safer training, but safety guarantees that hold even when capabilities exceed current testing regimes.
- Bounding model outputs in the presence of distributional shift or adversarial perturbationsQuantifying rare but catastrophic behaviors (e.g., heavy-tailed distributions)Analyzing stability in multi-agent systemsFormalizing unpredictability via dynamical systems analysis or chaos theoryConnecting statistical uncertainty in model behavior to logical or agent-based reasoning frameworks
Connection to Physics:
Physicists routinely quantify the unknown, bounding system behavior under noise, drift, or instability. To support this, they have a good understanding of theoretical limits and how far from them the physical system of interest is.
Example Research Directions: analyzing stability in human-AI feedback loops using control theory; provable unpredictability and connections with chaos theory; understanding stability regimes in AI training including edge-of-stability phenomena; modeling rare events and separating signal from noise; probing scale separation for identifying safe asymptotic regions.
Problem Area 5: Scaling Laws, Phase Transitions, and Emergence
The crux: AI performance improves as a power law (in model size, dataset size, or compute) across a wide variety of modalities, architectures, and tasks. Yet emergent capabilities and sudden training behavior that depend on scale seem to defy simple extrapolation. Is there an end to the ‘bitter lesson’?
Problem Description:
While scaling laws are widely observed, they are still poorly understood. We want to model the limits of the scaling hypothesis, understand transitions in representation and capability, and develop toy models where emergence can be rigorously analyzed.
- Are observed power laws truly universal, or artifacts of particular model/data choices?How can apparently smooth scaling laws be reconciled with the emergence of discrete capabilities (e.g., grokking)?
Connection to Physics:
In many ways, physics is the science of scale. Critical phenomena, universality, renormalization, and scale separation all help to explain, characterize, and predict behavior in physical systems. Can similar models of data, training, and inference provide intuition and grounding for empirical scaling laws?
Example Research Directions: scaling laws from the data manifold dimension; scaling laws from discrete power-law-distributed skills; transfer learning phase diagrams in the low-data limit; controlled scaling experiments on synthetic data; models for transient capabilities and emergent in-context learning; non-standard analysis as a formal language for scale.
Discuss