MarkTechPost@AI 01月19日
Google AI Introduces ZeroBAS: A Neural Method to Synthesize Binaural Audio from Monaural Audio Recordings and Positional Information without Training on Any Binaural Data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

谷歌AI推出ZeroBAS,一种无需双声道训练数据的零样本神经方法,用于单声道到双声道语音合成。该方法利用几何时间扭曲(GTW)和幅度缩放(AS)技术,基于声源位置初步生成双声道信号,并通过预训练的去噪声码器进一步优化。ZeroBAS在多种房间条件下表现出良好的泛化能力,并在客观和主观评估中均达到与有监督方法相当甚至更优的性能,为增强AR、VR等沉浸式体验提供了有力的技术支持。

👂几何时间扭曲 (GTW): 通过模拟人耳之间的时差(ITD),将单声道输入转换为双声道,为双声道音频的合成奠定基础。

🔊幅度缩放 (AS): 根据声源与人耳的距离,模拟人耳之间的强度差异(ILD),进一步增强双声道音频的空间真实感,特别是在高频声音方面。

✨预训练去噪声码器:利用WaveFit对GTW和AS处理后的信号进行迭代优化,减少音频伪像,确保生成高质量的双声道音频输出。

Humans possess an extraordinary ability to localize sound sources and interpret their environment using auditory cues, a phenomenon termed spatial hearing. This capability enables tasks such as identifying speakers in noisy settings or navigating complex environments. Emulating such auditory spatial perception is crucial for enhancing the immersive experience in technologies like augmented reality (AR) and virtual reality (VR). However, the transition from monaural (single-channel) to binaural (two-channel) audio synthesis—which captures spatial auditory effects—faces significant challenges, particularly due to the limited availability of multi-channel and positional audio data.

Traditional mono-to-binaural synthesis approaches often rely on digital signal processing (DSP) frameworks. These methods model auditory effects using components such as the head-related transfer function (HRTF), room impulse response (RIR), and ambient noise, typically treated as linear time-invariant (LTI) systems. Although DSP-based techniques are well-established and can generate realistic audio experiences, they fail to account for the nonlinear acoustic wave effects inherent in real-world sound propagation.

Supervised learning models have emerged as an alternative to DSP, leveraging neural networks to synthesize binaural audio. However, such models face two major limitations: First, the scarcity of position-annotated binaural datasets and second, susceptibility to overfitting to specific acoustic environments, speaker characteristics, and training datasets. The need for specialized equipment for data collection further constraints these approaches, making supervised methods costly and less practical.

To address these challenges, researchers from Google have proposed ZeroBAS, a zero-shot neural method for mono-to-binaural speech synthesis that does not rely on binaural training data. This innovative approach employs parameter-free geometric time warping (GTW) and amplitude scaling (AS) techniques based on source position. These initial binaural signals are further refined using a pretrained denoising vocoder, yielding perceptually realistic binaural audio. Remarkably, ZeroBAS generalizes effectively across diverse room conditions, as demonstrated using the newly introduced TUT Mono-to-Binaural dataset, and achieves performance comparable to, or even better than, state-of-the-art supervised methods on out-of-distribution data.

The ZeroBAS framework comprises a three-stage architecture as follows:

    In stage 1, Geometric time warping (GTW) transforms the monaural input into two channels (left and right) by simulating interaural time differences (ITD) based on the relative positions of the sound source and listener’s ears. GTW computes the time delays for the left and right ear channels. The warped signals are then interpolated linearly to generate initial binaural channels.In stage 2, Amplitude scaling (AS) enhances the spatial realism of the warped signals by simulating the interaural level difference (ILD) based on the inverse-square law. As human perception of sound spatiality relies on both ITD and ILD, with the latter dominant for high-frequency sounds. Using the Euclidean distances of source from both ears and , the amplitudes are scaled.In stage 3, involves an iterative refinement of the warped and scaled signals using a pretrained denoising vocoder, WaveFit. This vocoder leverages log-mel spectrogram features and denoising diffusion probabilistic models (DDPMs) to generate clean binaural waveforms. By iteratively applying the vocoder, the system mitigates acoustic artifacts and ensures high-quality binaural audio output.

Coming to evaluations, ZeroBAS was evaluated on two datasets (results in Table 1 and 2): the Binaural Speech dataset and the newly introduced TUT Mono-to-Binaural dataset. The latter was designed to test the generalization capabilities of mono-to-binaural synthesis methods in diverse acoustic environments. In objective evaluations, ZeroBAS demonstrated significant improvements over DSP baselines and approached the performance of supervised methods despite not being trained on binaural data. Notably, ZeroBAS achieved superior results on the out-of-distribution TUT dataset, highlighting its robustness across varied conditions.

Subjective evaluations further confirmed the efficacy of ZeroBAS. Mean Opinion Score (MOS) assessments showed that human listeners rated ZeroBAS’s outputs as slightly more natural than those of supervised methods. In MUSHRA evaluations, ZeroBAS achieved comparable spatial quality to supervised models, with listeners unable to discern statistically significant differences.

Even though this method is quite remarkable, it does have some limitations. ZeroBAS struggles to directly process phase information because the vocoder lacks positional conditioning, and it relies on general models instead of environment-specific ones. Despite these constraints, its ability to generalize effectively highlights the potential of zero-shot learning in binaural audio synthesis.

In conclusion, ZeroBAS offers a fascinating, room-agnostic approach to binaural speech synthesis that achieves perceptual quality comparable to supervised methods without requiring binaural training data. Its robust performance across diverse acoustic environments makes it a promising candidate for real-world applications in AR, VR, and immersive audio systems.


Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)

The post Google AI Introduces ZeroBAS: A Neural Method to Synthesize Binaural Audio from Monaural Audio Recordings and Positional Information without Training on Any Binaural Data appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ZeroBAS 双声道音频合成 零样本学习 GTW AS
相关文章