This paper presents a new SBI algorithm that utilizes transformer architectures and score-based diffusion models. Unlike traditional approaches, it can estimate the posterior, the likelihood, and other arbitrary conditionals once trained. It also handles missing data, leverages known dependencies in the simulator, and performs well on common benchmarks.Recent developments in simulation-based inference (SBI) have mainly focused on parameterestimation by approximating either the posterior distribution or the likelihood.The general idea is to use samples from the prior distribution and correspondingsimulated data to train neural networks that learn a density in the parameters(for posterior estimation) or a density in the data (for likelihood estimation).Successful applications of this approach employed normalizing flows [Pap19S], score matching [Sha22S], or flow matching [Wil23F] (covered in our recent pills).The SimformerAll-in-one simulation-based inference [Glo24A]takes a different approach. It uses a transformer architecture that is trainedon the joint distribution $p(\theta, x) =: p(\hat{x})$, taking both parametersand data as input. It thereby encodes parameters and data and predicts the scorefunction to perform inference via score-based diffusion models [Son19G].Figure 1 [Glo24A]: All-in-onesimulation-based inference. Given a prior over parameters, a simulator andobserved data, the Simformer can infer the posterior, likelihood and otherconditionals of parameters and data.Once trained, this so-called Simformer has the following properties (Figure1):It handles both parametric and non-parametric (i.e., $\infty$-dimensional)simulators,can take into account known dependencies from the simulatortakes into account known dependencies from the simulatorallows sampling all conditionals from the joint distribution,and can be conditioned on intervals, i.e., ranges of data or parameter values.These remarkable properties emerge from the design of the Simformer: atransformer architecture with SBI-tailored tokenizer and attention masks andthe use of score-matching and guided diffusion.A tokenizer for SBITransformers process inputs in the form of uniformly sized vectors calledtokens. [Glo24A] propose to construct each token toconsist of a unique identifier, its value and a condition state (Figure 2, topleft). The condition state is a mask $M_C \in {0, 1}^d$ thatindicates whether the given input $(\theta, x) \in \mathbb{R}^d$ is currentlyobserved (conditioned) or unobserved (latent). During training, the mask isresampled for every iteration so that the transformer learns the representationsfor both condition states. For example, setting $M_C^{(i)}=1$ for data and$M_C^{(i)}=0$ for parameters corresponds to training a conditional diffusionmodel of the posterior distribution (and vice versa for the likelihood). Tohandle different data and parameter modalities, the Simformer can rely onspecialized embedding networks as commonly used in SBI algorithms. Given inputembeddings defined by the tokenizer, the Simformer proceeds with the commontransformer architecture composed of attention heads, normalization, andfeed-forward layers (Figure 2, right). The output is an estimate of thescore function to be used in the common score-based training scheme (see below).Figure 2, [Glo24A]: TheSimformer architecture. The tokenizer encodes id, values, and a conditionstate for both parameters and data to enable arbitrary conditioning at trainingand inference time. The attention blocks of the transformer are constrained byattention masks that encode known dependencies between parameters and data.Modeling dependency structures with attention masksSimulators are not always black boxes. For some simulators, practitioners mayknow about certain dependencies structures between parameters and data. Forexample, a subset of parameters may be independent, and others may influenceonly specific data dimensions. The Simformer allows to incorporate thesedependency structures into the transformer architecture using attention masks$M_E$ [Wei23G] (Figure 2, lower left).The attention masks can be dynamically adapted at training and inference time,depending on the input values and condition state.Training and inferenceThe Simforer is trained as a score-based diffusion model using denoisingscore-matching, which is explained nicely in [Glo24A] sections 2.3 and 3.3. In essence, the transformer outputs an estimate of thescore for a given attention and conditioning mask,$s_{\phi}^{M_E}(\hat{x}_t^{M_C}, t)$. The transformer parameters $\phi$ areoptimized using the loss$$ \mathcal{l}(\phi, M_C, t, \hat{x}_0, \hat{x}_t) = (1 - MC) \cdot \left(s{\phi}^{M_E}(\hat{x}_t^{MC}, t) - \nabla{\hat{x}_t} \log p_t(\hat{x}_t\mid \hat{x}_0)\right), $$where $\hat{x}_0$ is a data sample, $\hat{x}_t$ is a noisy sample,$p_t(\hat{x}_t \mid \hat{x}_0)$ is defined through the tractable noise process,and $(1-M_C)$ ensures that the loss is applied w.r.t. to the unobservedvariables.After training, the Simformer allows sampling from arbitrary conditionals, e.g.,the posterior $p(\theta \mid x)$ or the likelihood $p(x \mid \theta)$ by maskingthe input accordingly. Samples are generated by integrating the reversediffusion process on all unobserved variables and holding the observed variablesconstant.Observation intervals via guided diffusionThe Simformer makes it possible to condition not only on fixed observed valuesbut on observation intervals, e.g., to obtain posterior distribution given ona range of possible data values. This is achieved via guided diffusion witharbitrary functions [Ban23U], a technique to samplefrom a diffusion model given additional context $y$. See paper section 3.4 andappendix A2.2 for details.Figure 3, ([Glo24A], Figure4): Benchmarking results. Left, comparison between Simformer and normalizingflow-based neural posterior estimation (NPE) on four common benchmarking tasks.Right, Simformer performance for arbitrary conditional on additionalbenchmarking tasks. Performance is measured in C2ST accuracy between estimatedand reference posterior samples, 0.5 is best.Results[Glo24A] evaluate their Simformer approach on fourcommon and two new SBI benchmarking tasks against the popular SBI algorithmneural posterior estimation (NPE). Additionally, they demonstrate thecapabilities of their approach on two time-series models (Lotka-Volterra andSIRD) and on a popular simulator from neuroscience (Hodgkin-Huxley model).BenchmarkingOn the benchmarking tasks, the Simformer outperforms NPE in terms of posterioraccuracy, as measured in classification two-sample test accuracy (C2ST, Figure3, left column). Notably, by exploiting known dependencies in thebenchmarking tasks, the Simformer can be substantially more simulation efficientthan NPE (Figure 3, orange and dark red lines). Additionally, theSimformer performed well in terms of C2ST when challenged to estimate arbitraryposterior conditionals (Figure 3, right column).Time-series dataFigure 4, ([Glo24A], Figure5): Lotka-Volterra model demonstration. The Simformer allows obtaining bothposterior predictive samples and posterior samples simultaneously whenconditioned on a small set of unstructured observations (panel a). Predictiveand parameter uncertainty decreases with more observations (panel b). Posteriorand data samples are accurate (panel c).For the application to the Lotka-Volterra model, the Simformer was trained onthe full time series data and then queried to perform inference given only asmall set of unstructured observations (Figure 4). Because of itsability to generate samples given arbitrary conditionals, it is now possible tocondition the Simformer only on the four observations and to obtain posteriorsamples (Figure 4, panel a right) and posterior predictive samples(Figure 4, panel a left) simultaneously and without running thesimulator. When given more observations, the posterior and posterior predictiveuncertainties decrease, as expected (Figure 4, pabel b).The authors further use the SIRD-model to demonstrate how to perform inferencewith infinite-dimensional parameters (section 4.3), and the Hodgkin-Huxley modelto show the use of guided diffusion to perform inference given observationintervals (section 4.4).LimitationsDespite its remarkable properties of arbitrary conditioning, simulationefficiency and exploitation of dependency structures, the Simformer also comeswith limitations. Compared to normalizing flow-based architectures (like NPE),it inherits the limitations of diffusion models that sampling and evaluationrequire solving the reverse SDE. Additionally, it requires substantially morememory and computational resources during training compared to normalizingflows.An implementation of the Simformer and code to reproduce the paper results isavailable on GitHub. It has not beenincorporated into one of the common SBI software packages yet.