Model misspecification is a critical challenge in simulation-based inference (SBI), particularly in neural SBI, where methods rely on simulated data to train neural networks. These methods often assume that simulators accurately represent the true data-generating process, but in practice, this assumption is frequently violated. Such discrepancies can result in observed data that are out-of-distribution relative to the simulations, leading to biased posterior distributions and unreliable inferences. This blog reviews recent work on model misspecification in SBI, discussing its definitions, methods for detection and mitigation, and open challenges. The aim is to emphasize the importance of developing robust SBI methods that can accommodate the complexities of real-world applications.IntroductionSimulation-based inference (SBI)provides a powerful framework for applying Bayesian inference to study complex systemswhere direct likelihood computation is infeasible [Cra20F]. Byusing simulated data to approximate posterior distributions, SBI has found applicationsacross diverse scientific fields, including neuroscience, physics, climate science andepidemiology [Gon20T, Bre20S, Wat21M, Wit20S]. However, these methods oftenassume that the simulator is a faithful representation of the true data-generatingprocess. In practice, this assumption is frequently violated, leading to modelmisspecification. In this blog post, we provide an overview of the currently availableapproaches to detect and mitigate model misspecification in SBI, and discuss openchallenges.In standard Bayesian inference, model misspecification can lead to biased or misleadingposterior estimates. However, in neural SBI, the problem is particularly severebecause the posterior or likelihood is approximated using neural networks trained onsimulated data. Neural networks are known to produce arbitrarily incorrect predictionswhen probed with out-of-distribution (OOD) data [Sze14I],and in a misspecified simulator, the observed data $\mathbf{x}_o$ is effectively OODrelative to the training distribution. This can lead to highly unreliable posteriorestimates, distorted uncertainty quantification, and incorrect scientific conclusions.An illustrative example of model misspecification has been provided by Ward et al.(2022) [War22R] using a simplified version of the Susceptible,Infected, Recovered (SIR) modelt that is commonly used in epidemiology. This simulatorestimates key parameters such as the infection rate $\beta$ and recovery rate $\gamma$,with observations summarized by metrics like the maximum number of infections, timing ofpeak infection, and autocorrelation. In this setting, misspecification is introducedthrough a delay in weekend infection counts, with cases shifted to the following Monday.This subtle mismatch between real and simulated data structures can lead to biasedposterior estimates and unreliable uncertainty quantification in neural SBI [War22R, Can22I].The sensitivity of neural networks to OOD data underscores the importance of robustdiagnostics and addressing model misspecification is crucial for ensuring thereliability of SBI in real-world applications. Below, we comment on the definitionof model misspecification in the context of SBI, reviews recent methods to detect andmitigate its effects, and outlines open challenges for future research.Defining Model MisspecicationModel misspecification occurs when the assumptions of the model do not align with thetrue data-generating process, leading to unreliable inferences. In Bayesian inference,this problem arises when the true data-generating process cannot be captured within thefamily of distributions defined by the model. Walker (2013) provides a foundationaldefinition [Wal13B]:A statistical model $p(\mathbf{x}_s | \theta)$ that relates a parameter of interest$\theta \in \Theta$ to a conditional distribution over simulated observations$\mathbf{x}_s$ is said to be misspecified if the true data-generating process$p(\mathbf{x}_o)$ of the real observations $\mathbf{x}_o \sim p(\mathbf{x}_o)$ does notbelong to the family of distributions ${p(\mathbf{x}_s | \theta); \theta \in \Theta}$.This structural definition provides a theoretical basis for understanding modelmisspecification but does not fully address its practical implications in SBI workflows.Model Misspecification in SBISBI is particularly sensitive to model misspecification because the model is definedthrough a simulator, and inference relies entirely on simulator-generated data. Unlikeclassical Bayesian inference, where the likelihood function is explicit, simulators inSBI may introduce subtle discrepancies that propagate through the inference pipeline,resulting in biased posterior estimates.Model Misspecification in Approximate Bayesian ComputationThe issue of model misspecification in SBI was first systematically addressed by Frazieret al. (2020) [Fra19M] in the context of Approximate BayesianComputation (ABC, [Sis18H]). General approach of ABC is toobtain approximate posterior samples by comparing simulated and observed data using adistance metric and accepting only those parameters that generate simulation very closeto the observed data. When the data is high-dimensional, it is common to usehand-crafted or learned summary statistics. However, under misspecification, theposterior in ABC does not concentrate on the true parameters but instead on “pseudotrue”parameters that minimize discrepancies between simulated and observed summarystatistics. This leads to biased posteriors and unreliable credible intervals. Thechoice of summary statistics is central to this problem, as they determine how wellsimulated data align with observed data. While foundational for understandingmisspecification, ABC’s reliance on handcrafted summary statistics limits its relevanceto neural SBI methods, which use neural networks for feature extraction.Model Misspecification in Neural SBINeural SBI methods eliminate the need for manually chosen summary statistics by usingneural networks to approximate posterior distributions (or likelihoods or likelihoodratios) based on simulations. A popular neural SBI method is neural posterior estimation(NPE, [Pap16F]), where a neural network is used tolearn a parametric approximation of the posterior distribution (e.g., a mixture ofGaussians, a normalizing flow, or a diffusion model) using simulated data. However, thisflexibility introduces new vulnerabilities. Neural networks trained on simulations canfail catastrophically when applied to observed data that lie outside the trainingdistribution. This issue has been systematically studied by Cannon et al. (2022) in thecontext of neural SBI [Can22I].However, before we dive into the methods to mitigate misspecification in SBI, let usnote that there are at least three different sources of inaccuaries in the neural SBIworkflow:Misspecification of the Simulator: The true data-generating process does notbelong to the family of distributions induced by the simulator. This corresponds tothe classical Bayesian notion of misspecification described by Walker (2013). Forexample, if a simulator lacks the capacity to model key features of the observeddata, the resulting posterior may fail to capture the true parameter valuesaccurately.Misspecification of the Prior: Misspecification can also occur when the priorused in the inference process does not incorporate the “true parameter” underlyingthe data-generating process. Prior mismatch can distort posterior estimates, leadingto inferences that reflect artifacts of the assumed prior rather than the trueunderlying process.Errors in the Inference Procedure: Even if the simulator and prior are correctlyspecified, the inference algorithm itself may introduce errors, such assystematically biased posteriors or uncalibrated uncertainty estimates, e.g., dueto underfitting or overfitting during neural-network training.The third case reflects a general challenge in neural SBI. Efforts to address theseissues include calibration tests such as simulation-based calibration [Tal20V], expected coverage diagnostics [Dei22T, Mil21T], and classifier-based calibrationtests [Zha21D, Lin24L]. These tools focus onvalidating posterior accuracy and uncertainty quantification, and are usually assumingthat prior and the simulator are well-specified.The second case of prior misspecification is a general challenge in the Bayesianinference and can be addressed with standard Bayesian workflow tools like priorpredictive checks [Gel20B]. Therefore, it has received lessattention in the SBI specific literature, with only brief discussions in works likeWehenkel & Gamella et al. (2023) [Weh24A].Thus, the primary focus of most work on model misspecification in the SBI literature isthe first case, with the aim of detecting and mitigating simulator-relatedmisspecification. In the remainder of this post, we will give an overview of theseapproaches.Addressing Model MisspecificationRecent works have introduced a range of methods to address model misspecification insimulation-based inference (SBI). These approaches can be broadly categorized into fourstrategies: learning explicit mismatch models, detecting misspecification throughlearned summary statistics, learning misspecification-robust statistics, and aligningsimulated and observed data using optimal transport. Each method has unique strengthsand limitations, which we summarize below.Learning Explicit Misspecification ModelsFigure 1 (adapted from [War22R]):Visualization of the robust neural posterior estimation (RNPE) framework.Ward et al. (2022) [War22R] propose Robust Neural PosteriorEstimation (RNPE), an extension of Neural Posterior Estimation (NPE), to addressmisspecification by explicitly modeling discrepancies between observed and simulateddata. RNPE introduces an error model, $p(\mathbf{y} | \mathbf{x})$, where$\mathbf{y}$ represents observed data and $\mathbf{x}$ simulated data. This error modelcaptures mismatches, enabling the “denoising” of observed data into latent variables$\mathbf{x}$ that are consistent with the simulator (Figure 1).The method trains a standard NPE on simulated data while enabling its application topotentially misspecified observed data through a denoising step. This is achieved bycombining a marginal density model $q(\mathbf{x})$ trained on simulated data with theexplicitly assumed error model $p(\mathbf{y} | \mathbf{x})$. The error model isparametrized and trained alongside the NPE density estimator. Using Monte Carlosampling, the denoised latent variables $\mathbf{x}_m \sim p(\mathbf{x} | \mathbf{y})$are obtained and used to approximate the posterior $p(\theta | \mathbf{x}_m)$.The results presented in [War22R] demonstrate that RNPEenables misspecification-robust NPE across three benchmarking tasks and an intractableexample application. By explicitly modeling the error for each data dimension, theapproach also facilitates model criticism, allowing practitioners to identify featuresin the data that are more likely to be misspecified. However, the method relies onselecting an appropriate error model, such as the “spike-and-slab” model, which may notgeneralize to all misspecification scenarios. Furthermore, the approach iscomputationally intensive, requiring additional inference steps, and is most effectivein low-dimensional data spaces.Detecting Misspecification with Learned Summary StatisticsFigure 2 (adapted from [Sch24D]):Simulated data is used to train a neural network to map into a latent space designed todetect misspecification. At inference time, the observed data is embedded mapped intothe latent space to detect misspecification.Schmitt et al. (2024) [Sch24D] focus on detectingmisspecification using learned summary statistics. Their method employs a summarynetwork, $h\psi(\mathbf{x})$, to encode both observed and simulated data into astructured summary space, typically following a multivariate Gaussian distribution(Figure 2). Discrepancies between distributions in this space are quantifiedusing metrics like Maximum Mean Discrepancy (MMD), with significant divergencesindicating misspecification.The training procedure for this approach remains the same as in standard neural SBImethods except for an additional MMD term in the NPE loss function:$$\mathcal{L}{\phi, \psi} = \mathcal{L}{\text{inference}}(\phi) + \lambda \cdot\text{MMD}^2[p(h{\psi}(\mathbf{x})), \mathcal{N}(\mathbf{0}, \mathbb{I})].$$Intuitively, the additional MMD loss term encourages the embedding network to obtain aGaussian structure in the latent summary space, while not directly affecting the qualityof the posterior estimation ensured by the standard NPE loss [Sch24D]. At inference time, the learned embedding network can thenused to detect misspecification for unseen, e.g., observed, data points.This approach is adaptable to diverse data types and does not require explicit knowledgeof the true data-generating process. Additionally, it is amortized, i.e., it can beapplied to new observed data without re-training because the training does not depend on$x_o$. However, its performance depends on the design of the summary network and thechoice of divergence metric. While effective for detecting misspecification, it does notdirectly correct for it, instead providing insights for iterative simulator refinement.Learning Misspecification-Robust Summary StatisticsHuang & Bharti et al. (2023) [Hua23L] propose a method forlearning summary statistics that are both informative about parameters and robust tomisspecification. Their approach is similar to the detection approach above in that itextends the standard NPE loss with an MMD term. However, this term directly takes intoaccount the embedded observed data and balances robustness to misspecification withinformativeness [Hua23L]:$$\mathcal{L} = \mathcal{L}{\text{inference}} + \lambda \cdot\text{MMD}^2[h{\psi}(\mathbf{x}s), h{\psi}(\mathbf{x}o)].$$Here, $h\psi$ represents the summary network, $\mathbf{x}{s}$ and $\mathbf{x}{o}$are simulated and observed data, respectively, and $\lambda$ controls the trade-offbetween inference accuracy and robustness. Unlike the detection method above, thisapproach directly adjusts the summary network during training to mitigate the impact ofmisspecification on posterior estimation.Benchmarking results presented in Huang & Bharti et al. (2023) demonstrate improvedperformance compared to the RNPE approach, with the additional advantage ofapplicability to high-dimensional data. However, the method has several limitations. Themodified loss function introduces additional complexity, and its success depends onselecting appropriate divergence metrics and regularization parameters, which oftenrequire domain-specific tuning. Additionally, the learned embedding is tailored to aspecific $x_o$ so that the method its ability to amortize over different observations.Furthermore, because robustness is implicitly learned during training and operates inthe latent space, there is limited direct control over how and where misspecification ismitigated.Addressing Misspecification with Optimal TransportFigure 3 (adapted from [Weh24A]): Visualization of ROPE; the top line shows the standardNPE approach of learning an embedding network and a posterior estimator. Additionally, acalibration set is used to fine-tune the embedding network for embedding observedreal-world data, and to learn an optimal transport mapping. At inference time, the OTmapping is used to obtain a misspecification-robust posterior estimate as a weighted sumof NPE posteriors.Wehenkel & Gamella et al. (2024) [Weh24A] propose a methodcalled ROPE that combines Neural Posterior Estimation (NPE) with optimal transport (OT)to address model misspecification. Their approach is designed for specific scenarioswhere a calibration set of real-world observations and their corresponding ground-truthparameter values is available. For instance, this may occur in expensive real-worldexperiments where ground-truth parameters can be measured, while a cheaper butmisspecified simulator models only parts of the underlying processes. The calibrationset is used to learn an optimal transport map $T$ that aligns simulated and observeddata distributions.The method begins by applying standard NPE to the simulated (misspecified) data to trainan embedding network $h_\psi(\mathbf{x}_s)$ and a posterior estimator $q(\theta |\mathbf{x}s)$. Next, the embedding network is fine-tuned on the labeled calibrationset, resulting in a modified embedding network $h\phi(\mathbf{x}o)$ tailored to theobserved data. This fine-tuned network ensures that embeddings for observed data alignbetter with those for simulated data (Figure 3).At inference time, a transport map $T$ is learned using OT, aligning the distributionsof embedded simulated data $h\psi(\mathbf{x}s)$ and observed data$h\phi(\mathbf{x}_o)$. The resulting transport matrix $P^\star$ is then used to computea mixture model for the desired real-world data posterior:$$\tilde{p}(\theta | \mathbf{x}o) = \sum{j=1}^{Ns} \alpha{ij} q(\theta | \mathbf{x}s^j),$$where $\alpha{ij} = No P^\star{ij}$, $N_o$ is the size of the calibration set, and$\mathbf{x}_s^j$ are $N_s$ simulated samples generated by running the simulator on priorparameters $\thetaj \sim p(\theta)$. The weights $\alpha{ij}$ from the OT solutioncombine the posteriors $p(\theta | \mathbf{x}_s^j)$, providing a robust posteriorestimate for the observed data $\mathbf{x}_o$.An interesting property of this approach is that as $N_s$, the number of simulatedsamples, grows, the mixture posterior $\tilde{p}(\theta | \mathbf{x}_o)$ approaches theprior $p(\theta)$. This underconfidence property provides a mechanism to ensure thatposterior estimates remain conservative and avoid overconfidence in the presence ofsevere misspecification. However, this effect introduces a trade-off: while increasing$N_s$ improves robustness to misspecification, it also reduces the informativeness ofthe posterior, potentially leading to overly broad parameter estimates. Selecting $N_s$appropriately is therefore crucial, as it balances reliability and uncertaintyquantification against the ability to extract meaningful parameter constraints (see[Weh24A] for heuristics).While conceptually elegant and flexible, this method relies on access to calibrationdata—observed data with known ground-truth parameters—which may not be available infields like cosmology or neuroscience. This reliance on calibration data limits itsapplicability to specific use cases.Summary of ApproachesThe methods discussed above tackle different facets of model misspecification in SBI,ranging from explicit error modeling to the development of robust summary statistics andthe alignment of simulated and observed data distributions. While each approachdemonstrates unique strengths, their applicability varies depending on the specificmisspecification scenario, computational complexity, and the availability of calibrationdata.However, the diversity of definitions, notations, and evaluation settings across theseworks highlights the need for a unified framework to define and compare methods.Similarly, the varying hyperparameter choices, methodological complexity, and absence ofstandardized benchmarks make it challenging for practitioners to navigate and applythese approaches effectively. These gaps motivate the need for better methods, commondefinitions, accessible benchmarks, and practical user guides, as we outline below.Open ChallengesThe recent works outlined above have made significant progress in addressing modelmisspecification in simulation-based inference (SBI), introducing methods for detectingand mitigating its effects. However, the problem of model misspecification in SBI is farfrom being fully resolved. While these methods offer valuable insights and tools, wehighlight key challenges that need to be addressed to further advance the field:Better Methods for Detecting and Addressing Model Misspecification: While recentmethods have improved our ability to diagnose and mitigate model misspecification,significant limitations remain. Many current techniques focus on specific aspects ofmisspecification, such as identifying discrepancies in summary statistics or aligningdata distributions via optimal transport. However, these approaches often requireadditional modeling assumptions, computational overhead, or prior knowledge about thenature of the misspecification. A key challenge is to develop more flexible andscalable methods that can:Detect misspecification in a principled and data-driven manner, without relying onpredefined summary statistics or manual tuning.Provide interpretable diagnostics that help practitioners understand the sourcesand consequences of misspecification in their models.Offer robust mitigation strategies that work across different types ofmisspecification, without requiring large amounts of additional data orcomputationally expensive corrections.A Common and Precise Definition of Model Misspecification in SBI: As highlightedin this post, model misspecification in SBI can arise from different sources,including mismatches between the simulator and the true data-generating process,prior misspecification, and errors introduced by the inference procedure itself. Acommon and formally precise definition of these different cases is essential forunifying the field. Such a framework would provide clarity for researchers andpractitioners, enabling a more systematic comparison of methods and theirapplicability to specific types of model misspecification.Common Benchmarking Tasks for Evaluating Methods: Another obstacle to progress inaddressing model misspecification is the lack of an established set of benchmarkingtasks tailored to the different cases of model misspecification. While currentevaluations often focus on specific scenarios or datasets, limiting thegeneralizability of conclusions, there are promising developments. For instance,Wehenkel & Gamella et al. [Weh24A] re-usedtasks proposed by Ward et al. [War22R] and introducedseveral new tasks designed to probe different aspects of model misspecification.These efforts provide a valuable starting point, but they need to be integrated intoa common benchmarking framework and made accessible through an open-source softwareplatform. Such a framework would enable researchers to rigorously test new methodsunder a variety of realistic model misspecification conditions, facilitating faircomparisons and encouraging the development of approaches robust across diversesettings.Practical Guidelines for Detecting and Addressing Model Misspecification: For SBIto be widely adopted in practice, there is a need for clear guidelines or apractitioner’s guide on how to detect and address model misspecification, e.g.,similar to a Bayesian workflow as introduced in [Gel20B].Such a guide should include recommendations for diagnosing model misspecificationusing available tools, selecting appropriate mitigation methods, and interpretingposterior results under potential misspecification. This would help bridge the gapbetween theoretical advancements and real-world applications, ensuring thatpractitioners can confidently apply SBI methods in the presence of modelmisspecification.Addressing these challenges will pave the way for more robust and practical SBI methodscapable of handling model misspecification effectively. A unified framework, rigorousbenchmarks, and practical guidelines will not only advance research on modelmisspecification but also simplify its handling in applied settings. Together, theseefforts will strengthen SBI as a reliable tool for scientific inference in complex andrealistic scenarios.References[Cra20F]The frontier of simulation-based inference,Kyle Cranmer, Johann Brehmer, Gilles Louppe.Dec2020[Gon20T]Training deep neural density estimators to identify mechanistic models of neural dynamics,Pedro J Gonçalves, Jan-Matthis Lueckmann, Michael Deistler, Marcel Nonnenmacher, Kaan Öcal, Giacomo Bassetto, Chaitanya Chintaluri, William F Podlaski, Sara A Haddad, Tim P Vogels, David S Greenberg, Jakob H Macke.Sep2020[Bre20S]Simulation-Based Inference Methods for Particle Physics,Johann Brehmer, Kyle Cranmer.Dec2020[Wat21M]Model calibration using ESEm v1.1.0 – an open, scalable Earth system emulator,Duncan Watson-Parris, Andrew Williams, Lucia Deaconu, Philip Stier.Dec2021[Wit20S]Simulation-Based Inference for Global Health Decisions,Christian Schroeder Witt, Bradley Gram-Hansen, Nantas Nardelli, Andrew Gambardella, Rob Zinkov, Puneet Dokania, N. Siddharth, Ana Belen Espinosa-Gonzalez, Ara Darzi, Philip Torr, Atılım Güneş Baydin.May2020[Sze14I]Intriguing properties of neural networks,Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergus.2014[War22R]Robust Neural Posterior Estimation and Statistical Model Criticism,Daniel Ward, Patrick Cannon, Mark Beaumont, Matteo Fasiolo, Sebastian Schmon.Dec2022[Can22I]Investigating the Impact of Model Misspecification in Neural Simulation-based Inference,Patrick Cannon, Daniel Ward, Sebastian M. Schmon.Sep2022[Wal13B]Bayesian inference with misspecified models,Stephen G. Walker.Oct2013[Fra19M]Model Misspecification in ABC: Consequences and Diagnostics,David T. Frazier, Christian P. Robert, Judith Rousseau.Jul2019[Sis18H]Handbook of Approximate Bayesian Computation,Scott A. Sisson, Yanan Fan, Mark Beaumont.Sep2018[Pap16F]Fast \epsilon -free Inference of Simulation Models with Bayesian Conditional Density Estimation,George Papamakarios, Iain Murray.2016[Tal20V]Validating Bayesian Inference Algorithms with Simulation-Based Calibration,Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, Andrew Gelman.Oct2020[Dei22T]Truncated proposals for scalable and hassle-free simulation-based inference,Michael Deistler, Pedro J. Goncalves, Jakob H. Macke.Dec2022[Mil21T]Truncated Marginal Neural Ratio Estimation,Benjamin K Miller, Alex Cole, Patrick Forré, Gilles Louppe, Christoph Weniger.2021[Zha21D]Diagnostics for conditional density models and Bayesian inference algorithms,David Zhao, Niccolò Dalmasso, Rafael Izbicki, Ann B. Lee.Dec2021[Lin24L]L-C2ST: local diagnostics for posterior approximations in simulation-based inference,Julia Linhart, Alexandre Gramfort, Pedro L. C. Rodrigues.May2024[Gel20B]Bayesian Workflow,Andrew Gelman, Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, Martin Modrák.Nov2020[Weh24A]Addressing Misspecification in Simulation-based Inference through Data-driven Calibration,Antoine Wehenkel, Juan L. Gamella, Ozan Sener, Jens Behrmann, Guillermo Sapiro, Marco Cuturi, Jörn-Henrik Jacobsen.May2024[Hua23L]Learning Robust Statistics for Simulation-based Inference under Model Misspecification,Daolang Huang, Ayush Bharti, Amauri Souza, Luigi Acerbi, Samuel Kaski.May2023[Sch24D]Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks: An Extended Investigation,Marvin Schmitt, Paul-Christian Bürkner, Ullrich Köthe, Stefan T. Radev.Jun2024