WECHSEL: Cross-Lingual Transfer

Content feed of the TransferLab — appliedAI Institute 2024年11月27日

WECHSEL: Cross-Lingual Transfer

WECHSEL是一种高效的跨语言语言模型迁移方法，它利用源语言模型的嵌入参数来初始化目标语言模型的嵌入参数，从而降低了目标语言模型训练的计算成本和数据需求。该方法通过构建源语言和目标语言的子词嵌入，并利用双语词典和正交普罗克拉斯算法对齐子词嵌入，最终利用余弦相似度加权平均的方式初始化目标语言模型的嵌入参数。WECHSEL方法在多种语言的实验中取得了显著的效果，尤其是在中低资源语言的迁移中，展现了其优越性，为降低大型语言模型训练成本，促进多语言模型发展提供了一种有效途径。

🤔 **语言迁移**：利用已训练好的语言模型（通常是英语）初始化目标语言模型，减少新语言模型训练的计算成本和数据需求，是解决多语言模型训练难题的重要途径。

🔄 **WECHSEL方法**：该方法利用源语言和目标语言的子词嵌入，并利用双语词典和正交普罗克拉斯算法对齐子词嵌入，最终利用余弦相似度加权平均的方式初始化目标语言模型的嵌入参数。

📖 **输入要求**：WECHSEL方法需要源语言分词器、源语言预训练语言模型、目标语言分词器以及源语言和目标语言的单语fastText嵌入作为输入。

📊 **实验结果**：WECHSEL方法在多种语言的实验中取得了显著的效果，尤其是在中低资源语言的迁移中，展现了其优越性，例如法语、德语、中文、斯瓦希里语、巽他语、苏格兰盖尔语、维吾尔语和马达加斯加语等。

🚀 **应用价值**：WECHSEL方法为降低大型语言模型训练成本，促进多语言模型发展提供了一种有效途径，有助于推动人工智能在不同语言环境中的应用。

Language transfer enables the use of language models trained in one or more languages to initialize a new language model in another language. WECHSEL is a cross-lingual language transfer method that efficiently initializes the embedding parameters of a language model in a target language using the embedding parameters from an existing model in a source language, facilitating more efficient training in the new language.Training Large Language Models (LLMs) requires significantcomputational resources, and most existing models are primarily trainedon English text. This creates substantial challenges for training LLMsin other languages, primarily due to high computational costsand insufficient training data in these languages.To address these challenges, we need methods that facilitatethe training of LLMs in new languages while minimizing environmental impact.One promising approach is Language Transfer, which involves initializinga model in a target language using a model trainedon a source language, typically English.Language Transfer approaches can be categorized into two main types:Mono-Lingual or Cross-Lingual TransferThis approach uses a model trained on a single source languageto initialize a new model in a target language that is differentfrom the source language. For example, transferring a modeltrained on English to German.Multi-Lingual TransferThis method utilizes a model trained on one or more source languagesto initialize a model in a target language, which can be one ofthe source languages or a different language. For instance, transferringa model trained on English and German to either German or French.WECHSELThe paper [Min22W] introducesWECHSEL, a method to transfer trained models to new languages.It requires as input:a tokenizer in the source language.a pre-trained language model in the source language.a tokenizer in the target language.2 monolingual fastText embeddings [Boj17E]for source and target languages respectively.They can be obtained in one of 2 ways:Figure 1: Summary of the WECHSEL method withinputs in blue, intermediate results in orange and outputs in greenAlgorithmFigure 1 is a diagram representing a high-levelsummary of the method and its different components.The method proceeds as follows:Use the tokenizers to split the words in the bilingual dictionaryinto subwords (tokens).Use the fastText embeddings to compute subword embeddingsas the sum of the embeddings of their n-grams.$$\mathbf{u}x = \sum{g \in \mathbb{G}(x)} \mathbf{w}_g$$where $\mathbb{G}(x)$ is the set of n-grams occurring in the subword $x$and $\mathbf{w}_g$ is the embedding of the n-gram $g$.The embedding of subwords in which no known n-gram occursare initialized to zero.Align subword embeddings using the bilingual dictionaryand the Orthogonal Procrustes method[Sch66G, Art16L].$$\underset{W}{\text{argmin}} \lVert \mathbf{U}^t W - \mathbf{U}^s \rVert_F^2$$Where $\lVert \cdot \rVertF$ is the Frobenius norm andthe matrix $W$ is required to be an orthogonal matrix ($W^T W = I$)Compute the cosine similarity of every subword in the source language, $x$to every subword in the target language, $y$, denoted as $s{x,y}$.$$s_{x,y} = \frac{\mathbf{u}_x^t \mathbf{u}_y^{sT}}{\lVert \mathbf{u}_x^t \rVert \lVert \mathbf{u}_y^s \rVert}$$Initialize embeddings of target model as the weighted averageof the embeddings of the source model by using cosine similaritybetween aligned sub-word embeddings as weights.$$\mathbf{e}x^t = \frac{\sum{y \in \mathcal{J}x} \exp(s{x,y} / \tau ) \cdot \mathbf{e}y^s}{\sum{y^{\prime} \in \mathcal{J}x} \exp(s{x,y^{\prime}} / \tau)}$$where $\mathcal{J}_x$ is the set of $k$ neighbouring subwordsin the source language.Subword embeddings that were set to zeroare initialized from a random normal distribution$\mathcal{N}(\mathbb{E}[\mathbf{E}^s], Var[\mathbf{E}^s])$.Copy non-embedding parameters of source model to target model.ExperimentsFor their experiments, the authors used RoBERTa and GPT-2 models trainedon English, transferring them to four medium-resource languages:French, German, Chinese, and Swahili, as well as four low-resource languages:Sundanese, Scottish Gaelic, Uyghur, and Malagasy.They employed automatically generated bilingual dictionariesfrom MUSE [Con18W] for French, German and Chineseand a bilingual dictionary from FreeDict for Swahili.For low-resource languages, they used bilingual dictionaries scrapedfrom Wiktionary, stored in their repository.The authors compared their method to two other approaches:FullRand Randomly initializes the target model and trains it from scratch.TransInner Randomly initializes the embedding parameters and copies thenon-embedding parameters. It trains only the embedding parameters for a fixedamount of steps while freezing the remaining parameters, then trains theentire model.For all methods and models, they trained on 65.5 billion tokens,significantly fewer than the models compared against(e.g., CamemBERT on 419.4 billion tokens, GBERTBaseon 255.6 billion tokens, and BERTBase-Chinese on 131.1 billion tokens).All models were trained for 250k steps with consistent hyperparameters across languages.WECHSEL-RoBERTa was evaluated by fine-tuning on XNLI for NLI performanceand on the balanced train-dev-test split of WikiANN for NER performance.WECHSEL-GPT-2 was evaluated by Perplexity (PPL) on a held-out setfrom the training corpus.ResultsFrom Figure 2, we observe that WECHSEL significantly improves cross-lingualparameter transfer and outperforms models of comparable size trainedfrom scratch, achieving up to 64x less training effort.As shown in Table 1 and Table 2,models initialized with WECHSEL generally outperformmodels trained from scratch and those initializedwith TransInner across all languages examined.Notably, the close relatedness of the source and target languages is nota prerequisite for effective transfer. For instance, in NLI tasks,WECHSEL improves absolute accuracy by 7.15%, 6.31%, 6.94%, and 4.71% overmodels trained from scratch for French, German, Chinese, and Swahili, respectively.Table 1: Results from fine-tuning RoBERTa models.Accuracy is reported for NLI on XNLI and micro F1 score forNER on WikiANN. Results are averaged over 3 runs.Scores are reported before training (Score@0), after 10% of the totalnumber of steps (Score@25k) and after training (Score@250k).Results from fine-tuning prior monolingualmodels and XLM–R (Score (more training)) are also reported.For each language, the best results in every column are indicatedwith underlines. The overall best results includingthe comparison with existing monolingual/multilingual models of comparable sizeare shown in bold.Figure 2: Test scores over training steps from fine-tuning RoBERTa modelson NLI (using XNLI) and NER (using WikiANN). Perplexity on the held-out set overtraining steps of GPT-2 models. The evaluation is done every 12.5k steps.Table 2: Results of training GPT2 models. The perplexity before training(PPL@0), after 10% of of the total number of steps (PPL@25k)and after training (PPL@250k) are reported.Despite its increased efficiency compared to training from scratch,the WECHSEL method has several inherent weaknesses. These include its complexityand a significant reliance on bilingual dictionariesfor aligning embedding vectors.This reliance is a significant issue especiallyif these dictionaries contain mistakes.Their quality directlyimpacts the alignment and, consequently, the model initialization.In the experiments, the authors used automatically extracted dictionariesfor three of the languages, which introduced omissions and errors. For instance,in the French-English dictionary,several mistakes were identified:Incorrect Word Mappings: The order of mappings is incorrect.For example, on line 80,feasible is mapped to viable whereas it should be mapped to feasable;on line 48641,friend is mapped to padna which is a word that doesn’t exist in French,according to this online dictionary;Inconsistent Order: The order of languages is inconsistent.In most lines, English precedes French(e.g., line 94),while in some, French precedes English (e.g.,line 81).These inaccuracies in the dictionaries can significantly affectthe performance and reliability of the resulting language models.Additionally, a crucial gap in the researchis the lack of investigation into the method’s effect on social biaseswithin the models. Although the authors acknowledge the potential riskof these biases, they neither explore its extent nor propose solutionsto mitigate or alleviate it. Addressing these concerns is essentialfor improving the reliability and ethical considerationsof using WECHSEL for transferring language models.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言迁移跨语言语言模型 WECHSEL 大型语言模型

相关文章

Coalition of news publishers sue Microsoft and OpenAI

Is Claude 3 Outperforming GPT-4?

Harmonizing AI: Crafting Personalized Song Suggestions

AI News Weekly - Issue #377: Next in AI : Pioneers' Predictions! - Mar 21st 2024

COLLAGE: A New Machine Learning Approach to Deal with Floating-Point Errors in Low-Precision to Make LLM Training Accurate and Efficient

This AI Paper by Microsoft and Tsinghua University Introduces YOCO: A Decoder-Decoder Architectures for Language Models

Leveraging Linguistic Expertise in NLP: A Deep Dive into RELIES and Its Impact on Large Language Models

Japanese Researchers Release “Fugaku-LLM” Trained on the Fugaku Supercomputer

Teaching Large Language Models to Reason with Reinforcement Learning with Alex Havrilla - #680

OLMo: Everything You Need to Train an Open Source LLM with Akshita Bhagia - #674