MarkTechPost@AI 05月14日 12:10
This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一项关于推理语言模型(RLM)的研究,该研究探讨了如何通过增加测试时的计算量,特别是通过扩展推理链,来提升以英语为中心的RLM的多语言推理能力。研究团队使用基于Qwen2.5-Instruct架构的模型,并在英语STEM推理样本上进行了微调。实验结果表明,增加思考tokens数量可以显著提高模型在非英语语言中的准确性,尤其是在高资源语言中。然而,这种提升并未推广到文化常识或人文等领域,表明需要进一步研究平衡的多语言训练和领域适应。

💡 研究重点:评估增加测试时计算量(通过扩展推理链)如何影响英语中心RLM的多语言推理能力。

🌍 实验设置:使用基于Qwen2.5-Instruct架构的模型,在1000个英语STEM推理样本上进行微调,并在MGSM和Global-MMLU等基准测试中跨多种语言进行测试。

📈 主要发现:增加思考tokens数量显著提升了模型在非英语语言中的准确性,特别是对于高资源语言,例如中文和英语,但对于低资源语言(如斯瓦希里语或泰卢固语)效果不佳。

🗣️ “引用-思考”行为:模型引用prompt中的非英语短语,并以英语进行推理,表明模型利用其多语言理解能力来解释非英语输入,而无需直接翻译。

📚 局限性:在STEM相关任务中表现良好,但在文化常识或人文等领域,增加思考tokens数量有时会降低性能,表明存在“过度思考”的问题。

Reasoning language models, or RLMs, are increasingly used to simulate step-by-step problem-solving by generating long, structured reasoning chains. These models break down complex questions into simpler parts and build logical steps to reach answers. This chain-of-thought (CoT) approach has proven effective in improving output quality, especially in mathematical and logical tasks. Despite multilingual capabilities in many modern large models, the focus of research and training has remained largely centered on English, leaving a gap in understanding how well these reasoning skills translate to other languages.

One major challenge is that most RLMs are fine-tuned on English data, which limits their ability to reason effectively in other languages. This becomes especially problematic for low-resource languages that have limited training examples. The models may default to English thinking patterns, producing lower-quality outputs when prompted in another language. Furthermore, differences in language structure can cause reasoning errors, particularly when a model trained in one language is expected to infer logic in another without adequate linguistic alignment.

Current techniques employ zero-shot or few-shot prompting strategies to manage these limitations, often using English as a pivot language. Some efforts involve presenting prompts in the same language as the query to preserve linguistic consistency. However, small models have minimal benefits due to limited capacity, and even large models show inconsistent performance when reasoning in low-resource languages. Despite multilingual pretraining, the gap between the training and reasoning language continues to hinder accurate multilingual reasoning.

The Brown University and MBZUAI research team focused on evaluating how increasing test-time computation, particularly through extended reasoning chains, can affect the multilingual reasoning abilities of English-centric RLMs. They investigated using s1 models based on the Qwen2.5-Instruct architecture and fine-tuned on 1,000 English STEM reasoning samples. These models were tested across various languages using benchmarks like MGSM and Global-MMLU to answer four core questions: the effectiveness of crosslingual test-time scaling, language-mixing behaviors, performance under language-forcing, and cross-domain generalization.

In-depth experiments showed that models with more parameters significantly benefited from increased test-time thinking tokens. The 14B s1 model, when scaled to 8,000 thinking tokens, achieved an average accuracy of 81% across non-English languages in MGSM. It outperformed models like Qwen2.5-14B-Instruct by +23.1% in French and +41.6% in Swahili. Even though the model was trained only in English, its performance surpassed that of larger models such as DeepSeek’s R1-Distill-Qwen-32B in several high-resource languages. The study also found that reasoning in high-resource languages like Chinese and English is more efficient, requiring fewer tokens and delivering better results than in low-resource languages like Swahili or Telugu.

A key observation was the “quote-and-think” behavior, where the model quoted non-English phrases from prompts and reasoned in English. This consistent pattern across languages like Japanese and Russian suggested that the model used its multilingual understanding to interpret non-English input without direct translation. Language-forcing experiments further confirmed that forcing reasoning in high-resource languages yielded better results, while strict reasoning in low-resource languages led to significant accuracy drops and computational inefficiencies.

Despite strong results in STEM-related tasks, performance gains did not transfer to domains like cultural commonsense or humanities. In benchmarks like FORK, increasing thinking tokens sometimes reduced performance, indicating overthinking. The study concludes that while test-time scaling enhances multilingual reasoning in high-resource languages, it does not generalize effectively to out-of-domain tasks or low-resource languages, indicating the need for further research on balanced multilingual training and domain adaptation.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

Here’s a brief overview of what we’re building at Marktechpost:

The post This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

推理语言模型 多语言推理 测试时扩展 Qwen2.5-Instruct 领域泛化
相关文章