Language transfer enables the use of language models trained in one or more languages to initialize a new language model in another language. WECHSEL is a cross-lingual language transfer method that efficiently initializes the embedding parameters of a language model in a target language using the embedding parameters from an existing model in a source language, facilitating more efficient training in the new language.Training Large Language Models (LLMs) requires significantcomputational resources, and most existing models are primarily trainedon English text. This creates substantial challenges for training LLMsin other languages, primarily due to high computational costsand insufficient training data in these languages.To address these challenges, we need methods that facilitatethe training of LLMs in new languages while minimizing environmental impact.One promising approach is Language Transfer, which involves initializinga model in a target language using a model trainedon a source language, typically English.Language Transfer approaches can be categorized into two main types:Mono-Lingual or Cross-Lingual TransferThis approach uses a model trained on a single source languageto initialize a new model in a target language that is differentfrom the source language. For example, transferring a modeltrained on English to German.Multi-Lingual TransferThis method utilizes a model trained on one or more source languagesto initialize a model in a target language, which can be one ofthe source languages or a different language. For instance, transferringa model trained on English and German to either German or French.WECHSELThe paper [Min22W] introducesWECHSEL, a method to transfer trained models to new languages.It requires as input:a tokenizer in the source language.a pre-trained language model in the source language.a tokenizer in the target language.2 monolingual fastText embeddings [Boj17E]for source and target languages respectively.They can be obtained in one of 2 ways:Figure 1: Summary of the WECHSEL method withinputs in blue, intermediate results in orange and outputs in greenAlgorithmFigure 1 is a diagram representing a high-levelsummary of the method and its different components.The method proceeds as follows:Use the tokenizers to split the words in the bilingual dictionaryinto subwords (tokens).Use the fastText embeddings to compute subword embeddingsas the sum of the embeddings of their n-grams.$$\mathbf{u}x = \sum{g \in \mathbb{G}(x)} \mathbf{w}_g$$where $\mathbb{G}(x)$ is the set of n-grams occurring in the subword $x$and $\mathbf{w}_g$ is the embedding of the n-gram $g$.The embedding of subwords in which no known n-gram occursare initialized to zero.Align subword embeddings using the bilingual dictionaryand the Orthogonal Procrustes method[Sch66G, Art16L].$$\underset{W}{\text{argmin}} \lVert \mathbf{U}^t W - \mathbf{U}^s \rVert_F^2$$Where $\lVert \cdot \rVertF$ is the Frobenius norm andthe matrix $W$ is required to be an orthogonal matrix ($W^T W = I$)Compute the cosine similarity of every subword in the source language, $x$to every subword in the target language, $y$, denoted as $s{x,y}$.$$s_{x,y} = \frac{\mathbf{u}_x^t \mathbf{u}_y^{sT}}{\lVert \mathbf{u}_x^t \rVert \lVert \mathbf{u}_y^s \rVert}$$Initialize embeddings of target model as the weighted averageof the embeddings of the source model by using cosine similaritybetween aligned sub-word embeddings as weights.$$\mathbf{e}x^t = \frac{\sum{y \in \mathcal{J}x} \exp(s{x,y} / \tau ) \cdot \mathbf{e}y^s}{\sum{y^{\prime} \in \mathcal{J}x} \exp(s{x,y^{\prime}} / \tau)}$$where $\mathcal{J}_x$ is the set of $k$ neighbouring subwordsin the source language.Subword embeddings that were set to zeroare initialized from a random normal distribution$\mathcal{N}(\mathbb{E}[\mathbf{E}^s], Var[\mathbf{E}^s])$.Copy non-embedding parameters of source model to target model.ExperimentsFor their experiments, the authors used RoBERTa and GPT-2 models trainedon English, transferring them to four medium-resource languages:French, German, Chinese, and Swahili, as well as four low-resource languages:Sundanese, Scottish Gaelic, Uyghur, and Malagasy.They employed automatically generated bilingual dictionariesfrom MUSE [Con18W] for French, German and Chineseand a bilingual dictionary from FreeDict for Swahili.For low-resource languages, they used bilingual dictionaries scrapedfrom Wiktionary, stored in their repository.The authors compared their method to two other approaches:FullRand Randomly initializes the target model and trains it from scratch.TransInner Randomly initializes the embedding parameters and copies thenon-embedding parameters. It trains only the embedding parameters for a fixedamount of steps while freezing the remaining parameters, then trains theentire model.For all methods and models, they trained on 65.5 billion tokens,significantly fewer than the models compared against(e.g., CamemBERT on 419.4 billion tokens, GBERTBaseon 255.6 billion tokens, and BERTBase-Chinese on 131.1 billion tokens).All models were trained for 250k steps with consistent hyperparameters across languages.WECHSEL-RoBERTa was evaluated by fine-tuning on XNLI for NLI performanceand on the balanced train-dev-test split of WikiANN for NER performance.WECHSEL-GPT-2 was evaluated by Perplexity (PPL) on a held-out setfrom the training corpus.ResultsFrom Figure 2, we observe that WECHSEL significantly improves cross-lingualparameter transfer and outperforms models of comparable size trainedfrom scratch, achieving up to 64x less training effort.As shown in Table 1 and Table 2,models initialized with WECHSEL generally outperformmodels trained from scratch and those initializedwith TransInner across all languages examined.Notably, the close relatedness of the source and target languages is nota prerequisite for effective transfer. For instance, in NLI tasks,WECHSEL improves absolute accuracy by 7.15%, 6.31%, 6.94%, and 4.71% overmodels trained from scratch for French, German, Chinese, and Swahili, respectively.Table 1: Results from fine-tuning RoBERTa models.Accuracy is reported for NLI on XNLI and micro F1 score forNER on WikiANN. Results are averaged over 3 runs.Scores are reported before training (Score@0), after 10% of the totalnumber of steps (Score@25k) and after training (Score@250k).Results from fine-tuning prior monolingualmodels and XLM–R (Score (more training)) are also reported.For each language, the best results in every column are indicatedwith underlines. The overall best results includingthe comparison with existing monolingual/multilingual models of comparable sizeare shown in bold.Figure 2: Test scores over training steps from fine-tuning RoBERTa modelson NLI (using XNLI) and NER (using WikiANN). Perplexity on the held-out set overtraining steps of GPT-2 models. The evaluation is done every 12.5k steps.Table 2: Results of training GPT2 models. The perplexity before training(PPL@0), after 10% of of the total number of steps (PPL@25k)and after training (PPL@250k) are reported.Despite its increased efficiency compared to training from scratch,the WECHSEL method has several inherent weaknesses. These include its complexityand a significant reliance on bilingual dictionariesfor aligning embedding vectors.This reliance is a significant issue especiallyif these dictionaries contain mistakes.Their quality directlyimpacts the alignment and, consequently, the model initialization.In the experiments, the authors used automatically extracted dictionariesfor three of the languages, which introduced omissions and errors. For instance,in the French-English dictionary,several mistakes were identified:Incorrect Word Mappings: The order of mappings is incorrect.For example, on line 80,feasible is mapped to viable whereas it should be mapped to feasable;on line 48641,friend is mapped to padna which is a word that doesn’t exist in French,according to this online dictionary;Inconsistent Order: The order of languages is inconsistent.In most lines, English precedes French(e.g., line 94),while in some, French precedes English (e.g.,line 81).These inaccuracies in the dictionaries can significantly affectthe performance and reliability of the resulting language models.Additionally, a crucial gap in the researchis the lack of investigation into the method’s effect on social biaseswithin the models. Although the authors acknowledge the potential riskof these biases, they neither explore its extent nor propose solutionsto mitigate or alleviate it. Addressing these concerns is essentialfor improving the reliability and ethical considerationsof using WECHSEL for transferring language models.