All-in-One Simulation-Based Inference

Content feed of the TransferLab — appliedAI Institute 2024年11月27日

All-in-One Simulation-Based Inference

本文介绍了一种名为Simformer的新型模拟推理算法，该算法利用Transformer架构和基于分数的扩散模型。与传统方法不同，Simformer在训练后可以估计后验概率、似然概率和其他任意条件概率。此外，它还可以处理缺失数据，利用模拟器中已知的依赖关系，并在常用基准测试中表现良好。Simformer通过训练一个Transformer模型来学习参数和数据的联合分布，并预测分数函数，从而利用基于分数的扩散模型进行推理。这种方法克服了传统SBI方法的局限性，实现了更灵活和高效的推理过程。

🤔 **利用Transformer架构和基于分数的扩散模型:** Simformer采用Transformer架构来学习参数和数据的联合分布，并利用基于分数的扩散模型进行推理，这与传统SBI方法不同，带来了新的可能性。

🔄 **能够估计后验概率、似然概率和其他任意条件概率:** 与仅关注参数估计或似然估计的传统方法不同，Simformer可以在训练后估计各种条件概率，增加了其应用的灵活性。

📊 **处理缺失数据并利用模拟器中的已知依赖关系:** Simformer能够处理模拟数据中的缺失值，并利用模拟器中已知的依赖关系，这使得其在更广泛的应用场景中具有优势。

🚀 **在常用基准测试中表现良好:** Simformer在一些常见的模拟推理基准测试中取得了良好的结果，证明了其有效性和实用性。

This paper presents a new SBI algorithm that utilizes transformer architectures and score-based diffusion models. Unlike traditional approaches, it can estimate the posterior, the likelihood, and other arbitrary conditionals once trained. It also handles missing data, leverages known dependencies in the simulator, and performs well on common benchmarks.Recent developments in simulation-based inference (SBI) have mainly focused on parameterestimation by approximating either the posterior distribution or the likelihood.The general idea is to use samples from the prior distribution and correspondingsimulated data to train neural networks that learn a density in the parameters(for posterior estimation) or a density in the data (for likelihood estimation).Successful applications of this approach employed normalizing flows [Pap19S], score matching [Sha22S], or flow matching [Wil23F] (covered in our recent pills).The SimformerAll-in-one simulation-based inference [Glo24A]takes a different approach. It uses a transformer architecture that is trainedon the joint distribution $p(\theta, x) =: p(\hat{x})$, taking both parametersand data as input. It thereby encodes parameters and data and predicts the scorefunction to perform inference via score-based diffusion models [Son19G].Figure 1 [Glo24A]: All-in-onesimulation-based inference. Given a prior over parameters, a simulator andobserved data, the Simformer can infer the posterior, likelihood and otherconditionals of parameters and data.Once trained, this so-called Simformer has the following properties (Figure1):It handles both parametric and non-parametric (i.e., $\infty$-dimensional)simulators,can take into account known dependencies from the simulatortakes into account known dependencies from the simulatorallows sampling all conditionals from the joint distribution,and can be conditioned on intervals, i.e., ranges of data or parameter values.These remarkable properties emerge from the design of the Simformer: atransformer architecture with SBI-tailored tokenizer and attention masks andthe use of score-matching and guided diffusion.A tokenizer for SBITransformers process inputs in the form of uniformly sized vectors calledtokens. [Glo24A] propose to construct each token toconsist of a unique identifier, its value and a condition state (Figure 2, topleft). The condition state is a mask $M_C \in {0, 1}^d$ thatindicates whether the given input $(\theta, x) \in \mathbb{R}^d$ is currentlyobserved (conditioned) or unobserved (latent). During training, the mask isresampled for every iteration so that the transformer learns the representationsfor both condition states. For example, setting $M_C^{(i)}=1$ for data and$M_C^{(i)}=0$ for parameters corresponds to training a conditional diffusionmodel of the posterior distribution (and vice versa for the likelihood). Tohandle different data and parameter modalities, the Simformer can rely onspecialized embedding networks as commonly used in SBI algorithms. Given inputembeddings defined by the tokenizer, the Simformer proceeds with the commontransformer architecture composed of attention heads, normalization, andfeed-forward layers (Figure 2, right). The output is an estimate of thescore function to be used in the common score-based training scheme (see below).Figure 2, [Glo24A]: TheSimformer architecture. The tokenizer encodes id, values, and a conditionstate for both parameters and data to enable arbitrary conditioning at trainingand inference time. The attention blocks of the transformer are constrained byattention masks that encode known dependencies between parameters and data.Modeling dependency structures with attention masksSimulators are not always black boxes. For some simulators, practitioners mayknow about certain dependencies structures between parameters and data. Forexample, a subset of parameters may be independent, and others may influenceonly specific data dimensions. The Simformer allows to incorporate thesedependency structures into the transformer architecture using attention masks$M_E$ [Wei23G] (Figure 2, lower left).The attention masks can be dynamically adapted at training and inference time,depending on the input values and condition state.Training and inferenceThe Simforer is trained as a score-based diffusion model using denoisingscore-matching, which is explained nicely in [Glo24A] sections 2.3 and 3.3. In essence, the transformer outputs an estimate of thescore for a given attention and conditioning mask,$s_{\phi}^{M_E}(\hat{x}_t^{M_C}, t)$. The transformer parameters $\phi$ areoptimized using the loss$$ \mathcal{l}(\phi, M_C, t, \hat{x}_0, \hat{x}_t) = (1 - MC) \cdot \left(s{\phi}^{M_E}(\hat{x}_t^{MC}, t) - \nabla{\hat{x}_t} \log p_t(\hat{x}_t\mid \hat{x}_0)\right), $$where $\hat{x}_0$ is a data sample, $\hat{x}_t$ is a noisy sample,$p_t(\hat{x}_t \mid \hat{x}_0)$ is defined through the tractable noise process,and $(1-M_C)$ ensures that the loss is applied w.r.t. to the unobservedvariables.After training, the Simformer allows sampling from arbitrary conditionals, e.g.,the posterior $p(\theta \mid x)$ or the likelihood $p(x \mid \theta)$ by maskingthe input accordingly. Samples are generated by integrating the reversediffusion process on all unobserved variables and holding the observed variablesconstant.Observation intervals via guided diffusionThe Simformer makes it possible to condition not only on fixed observed valuesbut on observation intervals, e.g., to obtain posterior distribution given ona range of possible data values. This is achieved via guided diffusion witharbitrary functions [Ban23U], a technique to samplefrom a diffusion model given additional context $y$. See paper section 3.4 andappendix A2.2 for details.Figure 3, ([Glo24A], Figure4): Benchmarking results. Left, comparison between Simformer and normalizingflow-based neural posterior estimation (NPE) on four common benchmarking tasks.Right, Simformer performance for arbitrary conditional on additionalbenchmarking tasks. Performance is measured in C2ST accuracy between estimatedand reference posterior samples, 0.5 is best.Results[Glo24A] evaluate their Simformer approach on fourcommon and two new SBI benchmarking tasks against the popular SBI algorithmneural posterior estimation (NPE). Additionally, they demonstrate thecapabilities of their approach on two time-series models (Lotka-Volterra andSIRD) and on a popular simulator from neuroscience (Hodgkin-Huxley model).BenchmarkingOn the benchmarking tasks, the Simformer outperforms NPE in terms of posterioraccuracy, as measured in classification two-sample test accuracy (C2ST, Figure3, left column). Notably, by exploiting known dependencies in thebenchmarking tasks, the Simformer can be substantially more simulation efficientthan NPE (Figure 3, orange and dark red lines). Additionally, theSimformer performed well in terms of C2ST when challenged to estimate arbitraryposterior conditionals (Figure 3, right column).Time-series dataFigure 4, ([Glo24A], Figure5): Lotka-Volterra model demonstration. The Simformer allows obtaining bothposterior predictive samples and posterior samples simultaneously whenconditioned on a small set of unstructured observations (panel a). Predictiveand parameter uncertainty decreases with more observations (panel b). Posteriorand data samples are accurate (panel c).For the application to the Lotka-Volterra model, the Simformer was trained onthe full time series data and then queried to perform inference given only asmall set of unstructured observations (Figure 4). Because of itsability to generate samples given arbitrary conditionals, it is now possible tocondition the Simformer only on the four observations and to obtain posteriorsamples (Figure 4, panel a right) and posterior predictive samples(Figure 4, panel a left) simultaneously and without running thesimulator. When given more observations, the posterior and posterior predictiveuncertainties decrease, as expected (Figure 4, pabel b).The authors further use the SIRD-model to demonstrate how to perform inferencewith infinite-dimensional parameters (section 4.3), and the Hodgkin-Huxley modelto show the use of guided diffusion to perform inference given observationintervals (section 4.4).LimitationsDespite its remarkable properties of arbitrary conditioning, simulationefficiency and exploitation of dependency structures, the Simformer also comeswith limitations. Compared to normalizing flow-based architectures (like NPE),it inherits the limitations of diffusion models that sampling and evaluationrequire solving the reverse SDE. Additionally, it requires substantially morememory and computational resources during training compared to normalizingflows.An implementation of the Simformer and code to reproduce the paper results isavailable on GitHub. It has not beenincorporated into one of the common SBI software packages yet.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

模拟推理 SBI Transformer 基于分数的扩散模型 Simformer

相关文章

Import AI 364: Robot scaling laws; human-level LLM forecasting; and Claude 3

Trends in Computer Vision with Georgia Gkioxari - #549

Social Commonsense Reasoning with Yejin Choi - #518

Trends in Natural Language Processing with Sameer Singh - #445

AI趨勢周報第252期：取代Transformer？LSTM之父發表新LLM架構

How ‘Chain of Thought’ Makes Transformers Smarter

This AI Paper by Toyota Research Institute Introduces SUPRA: Enhancing Transformer Efficiency with Recurrent Neural Networks

This AI Paper from Huawei Introduces a Theoretical Framework Focused on the Memorization Process and Performance Dynamics of Transformer-based Language Models (LMs)

Octo: An Open-Sourced Large Transformer-based Generalist Robot Policy Trained on 800k Trajectories from the Open X-Embodiment Dataset

惊喜发现又祛魅一项能力：读论文 CS 专业一路走来被论文折磨，现以为脱离苦海，但又不得不紧跟看 LLM SD 论文，痛点就是：看不下去，精神涣散?‍♂️啃能读完...