MarkTechPost@AI 2024年09月03日
WavTokenizer: A Breakthrough Acoustic Codec Model Redefining Audio Compression
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

WavTokenizer 是一种新颖的声学编解码模型,它在音频压缩方面取得了显著进展,能够将一秒钟的语音、音乐或音频压缩成仅 75 或 40 个高质量的令牌。该模型在 LibriTTS 测试清理数据集上取得了与现有模型相当的结果,同时提供了极高的压缩率。

😁WavTokenizer 通过减少量化器层数和离散编解码器的时域维度,实现了极高的压缩率,每秒 24kHz 音频仅需 40 或 75 个令牌。 WavTokenizer 的设计包含更广阔的 VQ 空间、扩展的上下文窗口、改进的注意力网络、强大的多尺度鉴别器和逆傅里叶变换结构。它在语音、音频和音乐等各个领域都表现出强大的性能。

🤩WavTokenizer 的架构旨在跨语言语音、音乐和音频等多个领域进行统一建模。 其大型版本在约 80,000 小时的各种数据集上进行了训练,包括 LibriTTS、VCTK、CommonVoice 等。 其中等版本使用 5,000 小时的子集,而小型版本则在 585 小时的 LibriTTS 数据集上进行了训练。 WavTokenizer 的性能通过使用来自各种框架(如 Encodec2、HiFi-Codec3 等)的官方权重文件,与最先进的编解码模型进行评估。 它在 NVIDIA A800 80G GPU 上进行了训练,输入样本为 24kHz。 所提出模型的优化是通过使用 AdamW 优化器以及特定的学习率和衰减设置来完成的。

🤔结果表明,WavTokenizer 在各种数据集和指标上都表现出色。 WavTokenizer-small 在 UTMOS 指标和 LibriTTS 测试清理子集上优于最先进的 DAC 模型 0.15,这与人类对音频质量的感知密切相关。 此外,该模型在所有指标上都优于 DAC 的 100 个令牌模式,仅使用 40 个和 75 个令牌,证明了其在使用单个量化器进行音频重建方面的有效性。 WavTokenizer 在客观指标(如 STOI、PESQ 和 F1 分数)上的表现与使用 4 个量化器的 Vocos 和使用 8 个量化器的 SpeechTokenizer 相当。

Large-scale language models have made significant progress in generative tasks involving multiple-speaker speech synthesis, music generation, and audio generation. The integration of speech modality into multimodal unified large models has also become popular, as seen in models like SpeechGPT and AnyGPT. These advancements are largely due to discrete acoustic codec representations used from neural codec models. However, it poses challenges in bridging the gap between continuous speech and token-based language models. While current acoustic codec models offer good reconstruction quality, there is room for improvement in areas like high bitrate compression and semantic depth.

Existing methods focus on three main areas to address challenges in acoustic codec models. The first method includes better reconstruction quality through techniques like AudioDec, which demonstrated the importance of discriminators, and DAC, which improved quality using techniques like quantizer dropout. The second method uses enhanced compression-led developments such as HiFi-Codec’s parallel GRVQ structure and Language-Codec’s MCRVQ mechanism, achieving good performance with fewer quantizers for both. The last method aims to deepen the understanding of codec space with TiCodec modeling time-independent and time-dependent information, while FACodec separates content, style, and acoustic details.

A team from Zhejiang University, Alibaba Group, and Meta’s Fundamental AI Research have proposed WavTokenizer, a novel acoustic codec model, that offers significant advantages over previous state-of-the-art models in the audio domain. WavTokenizer achieves extreme compression by reducing the layers of quantizers and the temporal dimension of the discrete codec, with only 40 or 75 tokens for one second of 24kHz audio. Moreover, its design contains a broader VQ space, extended contextual windows, improved attention networks, a powerful multi-scale discriminator, and an inverse Fourier transform structure. It demonstrates strong performance, in various domains like speech, audio, and music.

The architecture of WavTokenizer is designed for unified modeling across domains like multilingual speech, music, and audio. Its large version is trained on approximately 80,000 hours of data from various datasets, including LibriTTS, VCTK, CommonVoice, etc. Its medium version uses a 5,000-hour subset, while the small version is trained on 585 hours of LibriTTS data. The WavTokenizer’s performance is evaluated against state-of-the-art codec models using official weight files from various frameworks such as Encodec 2, HiFi-Codec 3, etc. It is trained on NVIDIA A800 80G GPUs, with input samples of 24 kHz. The optimization of the proposed model is done using the AdamW optimizer with specific learning rate and decay settings.

The results demonstrated the outstanding performance of WavTokenizer across various datasets and metrics. The WavTokenizer-small outperforms the state-of-the-art DAC model by 0.15 on the UTMOS metric and the LibriTTS test-clean subset, which closely aligns with human perception of audio quality. Moreover, this model outperforms DAC’s 100-token model across all metrics with only 40 and 75 tokens, proving its effectiveness in audio reconstruction with a single quantizer. The WavTokenizer performs comparably to Vocos with 4 quantizers and SpeechTokenizer with 8 quantizers on objective metrics like STOI, PESQ, and F1 score. 

In conclusion, WavTokenizer shows a significant advancement in acoustic codec models, capable of quantizing one second of speech, music, or audio into just 75 or 40 high-quality tokens. This model achieves results comparable to existing models on the LibriTTS test-clean dataset while offering extreme compression. The team conducted a comprehensive analysis of the design motivations behind the VQ space and decoder and validated the importance of each new module through ablation studies. The findings show that the WavTokenizer has the potential to revolutionize audio compression and reconstruction across various domains. In the future, researchers plan to solidify WavTokenizer’s position as a cutting-edge solution in the field of acoustic codec models.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

Here is a highly recommended webinar from our sponsor: ‘Building Performant AI Applications with NVIDIA NIMs and Haystack’

The post WavTokenizer: A Breakthrough Acoustic Codec Model Redefining Audio Compression appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

WavTokenizer 音频压缩 声学编解码模型
相关文章