MarkTechPost@AI 2024年11月28日
This AI Paper Introduces BEST-STD (Spoken Term Detection): A Novel Bidirectional Mamba-Enhanced Speech Tokenization Framework for Efficient Spoken Term Detection
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

BEST-STD是一种新颖的语音标记化框架,旨在解决传统语音术语检测(STD)方法中存在的词汇外(OOV)术语处理和计算需求等挑战。该框架利用双向Mamba编码器将语音编码成离散的、与说话人无关的语义标记,并结合文本检索算法,实现高效的语音术语检索。BEST-STD在LibriSpeech和TIMIT数据集上的评估中表现出色,在标记一致性、语音内容检索任务等方面均取得了显著的性能提升,并具有良好的可扩展性和效率。这项研究为语音术语检测领域带来了新的突破,有望推动音频处理领域的可访问性和可搜索性。

🤔 **双向Mamba编码器:**BEST-STD采用双向Mamba编码器,对音频数据进行前向和后向处理,捕捉长程依赖关系,并将音频数据投影到高维嵌入空间,通过矢量量化将其离散化为标记序列。这种编码方式能够生成高度一致的标记序列,使其对说话人和声学变化具有鲁棒性。

🚀 **自监督学习:**BEST-STD利用动态时间规整(DTW)对同一术语的不同语音进行对齐,生成帧级锚点-正样本对,并采用自监督学习方法训练模型,从而生成一致的标记表示,提高了模型的泛化能力。

🔍 **倒排索引:**BEST-STD使用倒排索引存储标记化序列,通过比较标记相似度实现高效的检索,减少了对计算密集型DTW匹配的依赖,提高了系统在处理大型数据集时的效率。

📊 **性能提升:**BEST-STD在LibriSpeech和TIMIT数据集上表现出优异的性能,在标记一致性、语音内容检索任务(MAP和MRR)方面均超越了传统的STD方法和最先进的标记化模型(如HuBERT、WavLM和SpeechTokenizer)。尤其是在处理词汇外术语时,性能提升尤为显著。

💡 **解决OOV挑战:**BEST-STD通过生成与说话人无关的语义标记,有效地解决了传统STD方法在处理词汇外术语时面临的挑战,提高了系统对不同数据集和术语类型的适应性。

Spoken term detection (STD) is a critical area in speech processing, enabling the identification of specific phrases or terms in large audio archives. This technology is extensively used in voice-based searches, transcription services, and multimedia indexing applications. By facilitating the retrieval of spoken content, STD plays a pivotal role in improving the accessibility and usability of audio data, especially in domains like podcasts, lectures, and broadcast media.

A significant challenge in spoken term detection is the effective handling of out-of-vocabulary (OOV) terms and the computational demands of existing systems. Traditional methods often depend on automatic speech recognition (ASR) systems, which are resource-intensive and prone to errors, particularly for short-duration audio segments or under variable acoustic conditions. Further, these methods need help accurately segment continuous speech, making identifying specific terms without context difficult.

Existing approaches to STD include ASR-based techniques that use phoneme or grapheme lattices, as well as dynamic time warping (DTW) and acoustic word embeddings for direct audio comparisons. While these methods have their merits, they are limited by speaker variability, computational inefficiency, and challenges in processing large datasets. Current tools also need help generalizing to different datasets, especially for terms not encountered during training.

Researchers from the Indian Institute of Technology Kanpur and imec – Ghent University have introduced a novel speech tokenization framework named BEST-STD. This approach encodes speech into discrete, speaker-agnostic semantic tokens, enabling efficient retrieval with text-based algorithms. By incorporating a bidirectional Mamba encoder, the framework generates highly consistent token sequences across different utterances of the same term. This method eliminates the need for explicit segmentation and handles OOV terms more effectively than previous systems.

The BEST-STD system uses a bidirectional Mamba encoder, which processes audio input in both forward and backward directions to capture long-range dependencies. Each layer of the encoder projects audio data into high-dimensional embeddings, which are discretized into token sequences through a vector quantizer. The model employs a self-supervised learning approach, leveraging dynamic time warping to align utterances of the same term and create frame-level anchor-positive pairs. The system uses an inverted index for storing tokenized sequences, allowing for efficient retrieval by comparing token similarity. During training, the system generates consistent token representations, ensuring invariance to the speaker and acoustic variations.

The BEST-STD framework demonstrated superior performance in evaluations conducted on the LibriSpeech and TIMIT datasets. Compared to traditional STD methods and state-of-the-art tokenization models like HuBERT, WavLM, and SpeechTokenizer, BEST-STD achieved significantly higher Jaccard similarity scores for token consistency, with unigram scores reaching 0.84 and bigram scores at 0.78. The system outperformed baselines on spoken content retrieval tasks in mean average precision (MAP) and mean reciprocal rank (MRR). For in-vocabulary terms, BEST-STD achieved MAP scores of 0.86 and MRR scores of 0.91 on the LibriSpeech dataset, while for OOV terms, the scores reached 0.84 and 0.90 respectively. These results underline the system’s ability to effectively generalize across different term types and datasets.

Notably, the BEST-STD framework also excelled in retrieval speed and efficiency, benefiting from an inverted index for tokenized sequences. This approach reduced reliance on computationally intensive DTW-based matching, making it scalable for large datasets. The bidirectional Mamba encoder, in particular, proved more effective than transformer-based architectures due to its ability to model fine-grained temporal information critical for spoken term detection.

In conclusion, the introduction of BEST-STD marks a significant advancement in spoken term detection. By addressing the limitations of traditional methods, this approach offers a robust & efficient solution for audio retrieval tasks. The use of speaker-agnostic tokens and a bidirectional Mamba encoder not only enhances performance but also ensures adaptability to diverse datasets. This framework demonstrates promise for real-world applications, paving the way for improved accessibility and searchability in audio processing.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)

The post This AI Paper Introduces BEST-STD (Spoken Term Detection): A Novel Bidirectional Mamba-Enhanced Speech Tokenization Framework for Efficient Spoken Term Detection appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语音术语检测 BEST-STD Mamba编码器 语音标记化 音频检索
相关文章