MarkTechPost@AI 04月29日 14:50
The WAVLab Team is Releases of VERSA: A Comprehensive and Versatile Evaluation Toolkit for Assessing Speech, Audio, and Music Signals
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

卡内基梅隆大学等机构的研究人员推出VERSA,一款用于评估语音、音频和音乐生成的Python工具包。VERSA集成了65个评估指标,提供729种可配置的指标变体,支持多种文件格式。它通过统一的YAML配置文件控制指标,简化了评估流程。VERSA在基准测试中表现出色,支持独立、依赖和分布式的多种评估方式,超越了现有工具的覆盖范围。该工具旨在减少主观变异性,提高可比性,并通过整合多种评估方法来提高研究效率,促进生成声音技术的进步。VERSA已在GitHub上公开发布,旨在成为声音生成任务评估的基础工具。

🛠️VERSA是一个Python工具包,集成了65个评估指标,并提供729种可配置的指标变体,用于评估语音、音频和音乐信号的生成质量。

💽VERSA支持多种文件格式,包括PCM、FLAC、MP3和Kaldi-ARK,并提供统一的YAML配置文件来简化指标选择和配置过程。

🎼VERSA覆盖了54个适用于语音任务的指标,22个适用于通用音频的指标,以及22个适用于音乐生成的指标,从而提供了前所未有的灵活性。

📊VERSA包含两个核心脚本'scorer.py'和'aggregate_result.py',分别用于指标计算和结果汇总,简化了评估和报告生成过程。

🔗VERSA支持使用匹配和非匹配的音频参考、文本转录和视觉线索进行评估,从而适用于多模态生成评估场景。

AI models have made remarkable strides in generating speech, music, and other forms of audio content, expanding possibilities across communication, entertainment, and human-computer interaction. The ability to create human-like audio through deep generative models is no longer a futuristic ambition but a tangible reality that is impacting industries today. However, as these models grow more sophisticated, the need for rigorous, scalable, and objective evaluation systems becomes critical. Evaluating the quality of generated audio is complex because it involves not only measuring signal accuracy but also assessing perceptual aspects such as naturalness, emotion, speaker identity, and musical creativity. Traditional evaluation practices, such as human subjective assessments, are time-consuming, expensive, and prone to psychological biases, making automated audio evaluation methods a necessity for advancing research and applications.

One persistent challenge in automated audio evaluation lies in the diversity and inconsistency of existing methods. Human evaluations, despite being a gold standard, suffer from biases such as range-equalizing effects and require significant labor and expert knowledge, particularly in nuanced areas like singing synthesis or emotional expression. Automatic metrics have filled this gap, but they vary widely depending on the application scenario, such as speech enhancement, speech synthesis, or music generation. Moreover, there is no universally adopted set of metrics or standardized framework, leading to scattered efforts and incomparable results across different systems. Without unified evaluation practices, it becomes increasingly difficult to benchmark the performance of audio generative models and track genuine progress in the field.

Existing tools and methods each cover only parts of the problem. Toolkits like ESPnet and SHEET offer evaluation modules, but focus heavily on speech processing, providing limited coverage for music or mixed audio tasks. AudioLDM-Eval, Stable-Audio-Metric, and Sony Audio-Metrics attempt broader audio evaluations but still suffer from fragmented metric support and inflexible configurations. Metrics such as Mean Opinion Score (MOS), PESQ (Perceptual Evaluation of Speech Quality), SI-SNR (Scale-Invariant Signal-to-Noise Ratio), and Fréchet Audio Distance (FAD) are widely used; however, most tools implement only a handful of these measures. Also, reliance on external references, whether matching or non-matching audio, text transcriptions, or visual cues, varies significantly between tools. Centralizing and standardizing these evaluations in a flexible and scalable toolkit has remained an unmet need until now.

Researchers from Carnegie Mellon University, Microsoft, Indiana University, Nanyang Technological University, the University of Rochester, Renmin University of China, Shanghai Jiaotong University, and Sony AI introduced VERSA, a new evaluation toolkit. VERSA stands out by offering a Python-based, modular toolkit that integrates 65 evaluation metrics, leading to 729 configurable metric variants. It uniquely supports speech, audio, and music evaluation within a single framework, a feature that no prior toolkit has comprehensively achieved. VERSA also emphasizes flexible configuration and strict dependency control, allowing easy adaptation to different evaluation needs without incurring software conflicts. Released publicly via GitHub, VERSA aims to become a foundational tool for benchmarking sound generation tasks, thereby making a significant contribution to the research and engineering communities.

The VERSA system is organized around two core scripts: ‘scorer.py’ and ‘aggregate_result.py’. The ‘scorer.py’ handles the actual computation of metrics, while ‘aggregate_result.py’ consolidates metric outputs into comprehensive evaluation reports. Input and output interfaces are designed to support a range of formats, including PCM, FLAC, MP3, and Kaldi-ARK, accommodating various file organizations from wav.scp mappings to simple directory structures. Metrics are controlled through unified YAML-style configuration files, allowing users to select metrics from a master list (general.yaml) or create specialized setups for individual metrics (e.g., mcd_f0.yaml for Mel Cepstral Distortion evaluation). To further simplify usability, VERSA ensures minimal default dependencies while providing optional installation scripts for metrics that require additional packages. Local forks of external evaluation libraries are incorporated, ensuring flexibility without strict version locking, enhancing both usability and system robustness.

When benchmarked against existing solutions, VERSA outperforms them significantly. It supports 22 independent metrics that do not require reference audio, 25 dependent metrics based on matching references, 11 metrics that rely on non-matching references, and five distributional metrics for evaluating generative models. For instance, independent metrics such as SI-SNR and VAD (Voice Activity Detection) are supported, alongside dependent metrics like PESQ and STOI (Short-Time Objective Intelligibility). The toolkit covers 54 metrics applicable to speech tasks, 22 to general audio, and 22 to music generation, offering unprecedented flexibility. Notably, VERSA supports evaluation using external resources, such as textual captions and visual cues, making it suitable for multimodal generative evaluation scenarios. Compared to other toolkits, such as AudioCraft (which supports only six metrics) or Amphion (15 metrics), VERSA offers unmatched breadth and depth.

The research demonstrates that VERSA enables consistent benchmarking by minimizing subjective variability, improving comparability by providing a unified metric set, and enhancing research efficiency by consolidating diverse evaluation methods into a single platform. By offering more than 700 metric variants simply through configuration adjustments, researchers no longer have to piece together different evaluation methods from multiple fragmented tools. This consistency in evaluation fosters reproducibility and fair comparisons, both of which are critical for tracking advancements in generative sound technologies.

Several Key Takeaways from the Research on VERSA include:


Check out the Paper, Demo on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post The WAVLab Team is Releases of VERSA: A Comprehensive and Versatile Evaluation Toolkit for Assessing Speech, Audio, and Music Signals appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

VERSA 音频评估 生成模型 Python工具包
相关文章