MarkTechPost@AI 2024年08月06日
Bytedance Researchers Present Cross Language Agent – Simultaneous Interpretation (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

字节跳动研究者推出CLASI,用于解决同声传译中的难题,提高翻译质量和效率

🎯CLASI通过数据驱动的策略学习方法,模仿人类口译员将完整句子分割成小块的方式,实现了严格的读写策略,以解决同声传译中的问题。它依据句法标记和上下文意义进行分割,提高了翻译的准确性和流畅性

💾为解决LLM难以学习罕见术语的问题,CLASI增强了两个额外模块:记录语音上下文的内存和包含术语及匹配翻译的外部知识数据库。同时,为减轻外部知识数据库可能带来的噪声和速度问题,研究者提出了多模态检索增强生成(MM-RAG)方法

📈CLASI采用三阶段训练方法:预训练、持续训练和微调。LLM和音频编码器分别使用大量内部数据集进行预训练,团队使用数十亿低质量合成语音翻译数据进行持续训练以实现模态对齐,还通过少量人工标注数据进行微调,使模型更具弹性并产生更好的翻译

One of the most difficult challenges in translation is simultaneous speech translation (SiST). The ability to translate spoken words into another language in real time is known as simultaneous speech translation, and it paves the way for instantaneous communication across language barriers. There has been a lot of buzz about machine-assisted autonomous interpretation in natural language processing (NLP). Streaming Automatic Speech Recognition (ASR), punctuation, and Machine Translation (MT) models are typically employed in a cascaded system in traditional simultaneous translation systems. Unfortunately, the ASR module is a common latency and error propagation source in such cascaded systems. 

Academic SiST models and commercial SiST engines have come a long way, yet translation quality still needs to improve. With the help of humans, studies evaluated the available SiST systems as they are now. These systems significantly impact the efficacy of communication from a user-centered standpoint since they only provide listeners with less than 42% of the correct information. On the other hand, a human translator can convey at least 95% of the intended meaning and often more than 70%. As a result, researchers utilize 80% to denote highly qualified human interpreters in this work. LLMs are suggested to complete the SiST task because of their enormous success with machine and spoken translation.

Starting with the read-write policy, which requires LLM only to offer partial translation for input speech, integrating LLM into the SiST takes work. Second, LLMs can’t learn rare terms or terminologies from training data; thus, getting human-equivalent performance is challenging. Finally, the performance on the SiST task is still hindered by the shortage of training data. In response to these challenges, researchers from ByteDance have introduced CLASI, a unique Cross-Lingual Agent that achieves Simultaneous Interpretation through the repeated execution of various operations. 

CLASI overcomes the first obstacle by emulating human interpreters’ approach of segmenting full sentences into smaller, more manageable pieces based on syntactic markers and contextual meaning. This is achieved through a data-driven policy learning method, enabling CLASI to learn and apply a rigorous read-write policy for SiST. To address the second obstacle, the CLASI agent was enhanced with two additional modules: a memory that records speech context and an external knowledge database with terminologies and matched translations. However, the external knowledge database can introduce noise and slow down the technique. To mitigate this, the researchers propose a new method called Multi-Modal Retrieval Augmented Generation (MM-RAG). This method uses a multi-modal retriever to search an external database for relevant information, thereby improving the efficiency of the CLASI agent. 

They add the obtained information and memory context to the LLM agent’s prompt to improve the translation using in-context learning. They use a three-stage training methodology—pretraining, ongoing training, and fine-tuning—to tackle the data scarcity of the SiST job. LLM and audio encoder are pre trained separately using their massive internal datasets. The team trains their model continuously using billions of tokens of low-quality synthetic speech translation data to further their goal of achieving modal alignment between voice and text. For LLM to make greater use of the retriever’s and preceding translation’s contextual information, they also incorporate several activities to improve its in-context learning capability. Finally, they use a tiny quantity of human-annotated data to fine-tune the model, making it more resilient and producing better translations by mimicking the actions of human professionals. Since SiST frequently incorporates compaction, abstraction, and paraphrasing, it is possible that the traditional automatic evaluation criteria of simultaneous interpretation do not accurately reflect its performance.

Valid Information Proportion (VIP)2 is a new evaluation metric they offer, which aligns with human interpreters. The primary goal of SiST is real-time communication, and VIP indicates the proportion of information that can be transmitted precisely. The researchers found that the proposed method significantly beats other available algorithms in human evaluations conducted on challenging real-world long speech datasets that are both diverse and varied in topic. As an example, in the direction of Chinese-to-English translation, CLASI gets an 81.3% VIP score, which is far better than human interpreters. This promising result indicates a bright future for SiST.

The results in Chinese-to-English and English-to-Chinese jobs were much better than those of commercial systems, but the team highlights that language considerations should be expanded in the future. Each translation round triggers a full action sequence in the presented implementation of CLASI. Since the model can accurately translate without any external knowledge, some activities are optional for simple translation scenarios. It is possible to train the model to skip extra steps in the future.

Therefore, the Valid Information Proportion (VIP) metric is suggested for enhanced human evaluation. This underscores the need for more reliable automated quality and latency measurements in the future. The evidence also points to the potential of reinforcement learning from human feedback (RLHF) to enhance LLM performance. While CLASI outperforms prior state-of-the-art systems, there is a clear need for additional research into improving multi-modal reward models, as well as RL approaches for SiST. Promising areas of study include multi-modal integration, such as end-to-end video-to-video or speech-to-speech production.  


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here


The post Bytedance Researchers Present Cross Language Agent – Simultaneous Interpretation (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CLASI 同声传译 字节跳动 翻译质量
相关文章