MarkTechPost@AI 08月01日 12:09
TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着大型语言模型(LLMs)在翻译领域的飞速发展,传统评估指标如BLEU已显不足。用户需要更细致、更具解释性的评估,以判断翻译的准确性、术语使用和受众适宜性。Sakana.ai的研究团队开发了TransEvalnia,一个创新的翻译评估和排名系统。该系统利用LLMs的提示能力,能够提供基于MQM维度的详细反馈,并对翻译进行评分和排名。TransEvalnia在多项语言对和任务中表现出色,甚至优于现有的领先模型,且其评估结果与人类判断高度一致。该系统还解决了排名中的位置偏见问题,并公开了所有数据、推理输出和代码,为翻译质量评估提供了新的方向。

🎯 **LLM驱动的翻译评估新范式**:传统的BLEU等指标难以满足日益增长的对高质量、精细化翻译评估的需求。TransEvalnia利用LLMs的提示能力,能够跨越简单数值评分,提供诸如准确性、术语、受众适宜性等关键维度的详细反馈,使评估过程更透明、更具洞察力。

🏆 **超越传统指标,对标人类判断**:TransEvalnia在多项语言对(如英日、中英)和任务上展现出与甚至优于现有领先MT评估模型的竞争力。通过与Claude 3.5和Qwen-2.5等LLMs集成,其评估结果与人类评分高度一致,显示出其在捕捉细微差别和提供专业级反馈方面的优势。

⚖️ **解决评估中的挑战与偏见**:该系统不仅关注评估的准确性,还积极解决了排名系统中的位置偏见问题。通过采用交错(interleaving)等策略,有效降低了偏见,提高了评估的可靠性。研究团队还公开了所有数据、推理输出和代码,促进了该领域的进一步研究和应用。

📜 **细粒度评估与多维度评分**:TransEvalnia将翻译分解为更小的片段(span by span)进行评估,并为每个片段在1-5的李克特量表上打分,同时考虑了诗歌文本的情感等特殊维度。这种细致的评估方式,结合了MQM框架的精髓,提供了更全面、更具参考价值的翻译质量判断。

💡 **透明度与可解释性是关键**:与仅提供分数但缺乏解释的传统方法不同,TransEvalnia通过生成“理由”来支持其评估,使用户能够理解评分的原因,识别潜在错误,并做出更明智的决策。这种对透明度的强调,是未来机器翻译评估发展的重要方向。

Translation systems powered by LLMs have become so advanced that they can outperform human translators in some cases. As LLMs improve, especially in complex tasks such as document-level or literary translation, it becomes increasingly challenging to make further progress and to accurately evaluate that progress. Traditional automated metrics, such as BLEU, are still used but fail to explain why a score is given. With translation quality reaching near-human levels, users require evaluations that extend beyond numerical metrics, providing reasoning across key dimensions, such as accuracy, terminology, and audience suitability. This transparency enables users to assess evaluations, identify errors, and make more informed decisions. 

While BLEU has long been the standard for evaluating machine translation (MT), its usefulness is fading as modern systems now rival or outperform human translators. Newer metrics, such as BLEURT, COMET, and MetricX, fine-tune powerful language models to assess translation quality more accurately. Large models, such as GPT and PaLM2, can now offer zero-shot or structured evaluations, even generating MQM-style feedback. Techniques such as pairwise comparison have also enhanced alignment with human judgments. Recent studies have shown that asking models to explain their choices improves decision quality; yet, such rationale-based methods are still underutilized in MT evaluation, despite their growing potential. 

Researchers at Sakana.ai have developed TransEvalnia, a translation evaluation and ranking system that uses prompting-based reasoning to assess translation quality. It provides detailed feedback using selected MQM dimensions, ranks translations, and assigns scores on a 5-point Likert scale, including an overall rating. The system performs competitively with, or even better than, the leading MT-Ranker model across several language pairs and tasks, including English-Japanese, Chinese-English, and more. Tested with LLMs like Claude 3.5 and Qwen-2.5, its judgments aligned well with human ratings. The team also tackled position bias and has released all data, reasoning outputs, and code for public use. 

The methodology centers on evaluating translations across key quality aspects, including accuracy, terminology, audience suitability, and clarity. For poetic texts like haikus, emotional tone replaces standard grammar checks. Translations are broken down and assessed span by span, scored on a 1–5 scale, and then ranked. To reduce bias, the study compares three evaluation strategies: single-step, two-step, and a more reliable interleaving method. A “no-reasoning” method is also tested but lacks transparency and is prone to bias. Finally, human experts reviewed selected translations to compare their judgments with those of the system, offering insights into its alignment with professional standards. 

The researchers evaluated translation ranking systems using datasets with human scores, comparing their TransEvalnia models (Qwen and Sonnet) with MT-Ranker, COMET-22/23, XCOMET-XXL, and MetricX-XXL. On WMT-2024 en-es, MT-Ranker performed best, likely due to rich training data. However, in most other datasets, TransEvalnia matched or outperformed MT-Ranker; for example, Qwen’s no-reasoning approach led to a win on WMT-2023 en-de. Position bias was analyzed using inconsistency scores, where interleaved methods often had the lowest bias (e.g., 1.04 on Hard en-ja). Human raters gave Sonnet the highest overall Likert scores (4.37–4.61), with Sonnet’s evaluations correlating well with human judgment (Spearman’s R~0.51–0.54). 

In conclusion, TransEvalnia is a prompting-based system for evaluating and ranking translations using LLMs like Claude 3.5 Sonnet and Qwen. The system provides detailed scores across key quality dimensions, inspired by the MQM framework, and selects the better translation among options. It often matches or outperforms MT-Ranker on several WMT language pairs, although MetricX-XXL leads on WMT due to fine-tuning. Human raters found Sonnet’s outputs to be reliable, and scores showed a strong correlation with human judgments. Fine-tuning Qwen improved performance notably. The team also explored solutions to position bias, a persistent challenge in ranking systems, and shared all evaluation data and code. 


Check out the Paper here. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post TransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

TransEvalnia 机器翻译评估 LLM AI翻译 自然语言处理
相关文章