The CLP-Transfer method introduces a novel approach for cross-lingual language transfer by leveraging token overlap and a small pre-trained model with the desired tokenizer, simplifying the transfer process without the need for fastText embeddings or bilingual dictionaries. Despite its practical advantages, the method’s performance on downstream tasks is limited, highlighting areas for future research and evaluation.In [Min22W], we explored WECHSEL,a cross-lingual transfer method leveragingfastText embeddings and bilingual dictionaries to initialize a new language modelin a target language from trained models in one or more source languages.In this post, we examine Cross-Lingual and Progressive Transfer (CLP-Transfer),introduced by [Ost23E]. This method usesa smaller language model in the target language with the desired tokenizerfor cross-lingual transfer.Key Assumptions of CLP-Transfer1. Vocabulary OverlapGiven tokenizers and their respective vocabularies $V_s$ and $V_t$for source and target languages, respectively, CLP-Transfer assumesa significant token overlap, i.e., $| V_s \cap V_t| \gg 0$.Table 1 shows the result of an empirical experimentthat demonstrates this assumption holds across varioustokenizers and languages.Overlapping token embeddings from the source modelare copied to the target model:$$\mathbf{u}_t := \mathbf{u}_s,\quad \text{if} \quad v \in V_s \cap V_t$$These overlapping tokens serve as anchors for computing embeddingsof non-overlapping tokens.Table 1: Number of overlapping vocabulary tokensbetween different tokenizers, normalized by the sourcevocabulary size. The tokenizers are English GPT-2, Arabic GPT-2,Finnish GPT-23, multilingual BLOOM, multilingual XGLM, andthe authors’ German tokenizer.2. Comparable Token EmbeddingsToken embeddings are assumed to be comparable across modelsof different sizes but with the same tokenizer, i.e., they can besubstituted for one another in computations.Figure 1 shows the results of an experiment comparing token embeddingsacross various sizes of English OPT models. The set of k-nearest neighbors$N_v$ with $k = 10$ for each token $v$ was analyzed to measureoverlapping neighbors across different model sizes$N_v^{(large)} \cap N_v^{(small)}$.This measure is normalized and computed for all available tokens.Figure 1: Similarity of token embeddings of differentOPT model sizes measured as overlapping k = 10 nearest neighborsfor all tokens in the vocabulary.This finding justifies using a smaller helper model to computethe cosine similarity between its embeddings and those of the source modelas a proxy for the larger target model:$$\mathbf{u}^{\text{(large)}}_t :=\underset{\widehat{v} \in V_s \cap V_t}{\sum}\frac{\widehat{\mathbf{u}}^{\text{(large)}}_s}{\delta(\mathbf{u}_t, \widehat{\mathbf{u}}_t)},\quad \text{if} \quad v \notin V_s \cap V_t$$Where the weight function $\delta$ has the objective to transferthe spatial properties from the small model to the large modeland is defined as follows:$$\delta(\mathbf{u}, \widehat{\mathbf{u}}) :=\frac{\cos(\mathbf{u}^{\text{(small)}}_t, \widehat{\mathbf{u}}^{\text{(small)}}_t)}{\underset{\widehat{v}^{\prime} \in V_s \cap V_t; \quad v^{\prime} \in V_s \cup V_t}{\sum}\cos(\mathbf{u}^{\prime\text{(small)}}_t, \widehat{\mathbf{u}}^{\prime\text{(small)}}_t)}$$The remaining model parameters are copied over as in WECHSEL [Min22W].ExperimentsModelsFor their experiments, the authors used the following models:GPT-2:Source: English GPT-2-XL model with 1.5B parameters.Helper (small): GPT-2-base model with 117M parametersinitialized with WECHSEL and then further trained.Target: GPT-2 models with parameter sizes from 117M to 1.5B.BLOOM:Source: Multilingual BLOOM model with 7.1B parameters.Helper (small): German BLOOM model with 1.5B parameters.Target: BLOOM models with parameter sizes from 1.5B to 6.4B.DatasetsGPT-2Data: Web-crawled data from the German subset of OSCAR v2019, similarto WECHSEL[Min22W].Training Set: First 4GB of the data, approximately 30.8B tokens.Validation Set: Next 0.4GB of the data.BLOOMData: Web-crawled content from the German subset of OSCAR v22.01(excluding headers, footers, noisy, and adult content)and the GC4 Corpus (including only head and middle parts).Deduplication: Removed duplicated content from CommonCrawl.Additional Data: German court decisions from Open Legal Data.Training Set: Approximately 50.4B tokens.EvaluationThe trained models were evaluated by:Perplexity on the GPT-2 validation dataset.Zero-shot performance on German downstream tasks:Sentiment analysis from GermEval 2017.Hate speech classification from GermEval 2018.News topic classification from GNAD10.Paraphrase identification from PAWSX.Natural language inference from XNLI.Stance detection from X-Stance.MethodsThe authors compared their method against three others:From-Scratch Training (referred to as FullRand in WECHSEL)Randomly initializes the target model and trains it from scratch.WECHSELRandom Uniformly chooses a class from the downstreamclassification tasks without using a model.Additionally, the authors compared their monolingual German models againstmultilingual models trained on German data, namely XGLM and mGPT.ResultsTable 2: Evaluation results of German downstream tasksin a zero-shot setting. The average score excludesthe OSCAR validation perplexity (PPL). Smaller models are on par or worsethan the random baseline. The BLOOM-CLP 6.4B modelachieves the best results on average but not in all benchmarks.Perplexity EvaluationAs shown in Table 2, all models initializedwith CLP-Transfer achieve the best evaluation perplexity on the OSCAR dataset.This is further illustrated in Figure 2and Figure 3.Figure 2: GPT-2-XL German (1.5B parameters).Validation perplexity with respect to the number of tokens comparingFrom-Scratch Training, WECHSEL, and CLP-Transfer.CLP-Transfer achieves the same perplexity as from-scratch trainingwith fewer tokens.GPT-2-XL:Figure 2 shows that the GPT-2-XL model initializedwith CLP-Transfer achieves the same perplexity as the from-scratch trainingafter training on only ~50% of the total number of tokens (dashed line).The difference between CLP-Transfer and WECHSEL is noticeable,favoring CLP-Transfer.BLOOM:Figure 3 shows that the BLOOM-6B-German modelinitialized with CLP-Transfer achieves the same perplexityas from-scratch training after training on only ~20% ofthe total number of tokens (dashed line).Figure 3: BLOOM-6B-German. Validation perplexitywith respect to the number of tokens comparing From-Scratch Trainingand CLP-Transfer. CLP-Transfer achieves the same perplexityas from-scratch training with fewer tokens.Downstream Task PerformanceThe results on downstream tasks are generally disappointing,as noted by the authors. The models initialized with CLP-Transferdo not perform significantly better than the random baseline on these tasksand perform even worse on some of the tasks.The GPT-2-XL-CLP model does not achieve the best results on any specific datasetand performs significantly worse on the hate speech classification taskfrom GermEval 2018.The seemingly strange discrepancy between the good perplexity score onthe validation set and the poor results on the downstream taskscan be explained by several factors:Dataset Splits: Both the training and validation sets come fromthe same dataset (OSCAR) with splits based on the order of the data ratherthan content. This can lead to better perplexity scores thatdo not necessarily translate to improved performance on downstream tasks.Perplexity as a Proxy: Perplexity is merely a proxy measureand does not necessarily correlate with performance on specific tasks.Additionally, the authors mention two other points that contribute tothe poor performance on downstream tasks:Lack of Fine-Tuning and Prompt Engineering: The models were neitherfine-tuned nor subjected to prompt engineering, which is crucial for achievinggood performance, especially given the model sizes and number of training tokens used.Dataset Quality: The quality of the datasets used for validationis variable. Some datasets, such as PAWSX, contain poorly translatedsamples, leading to less meaningful results.Although CLP-Transfer does not necessarily achieve better results than WECHSELon downstream tasks, it is signficantly simpler. It does not require trainingfastText embeddings or aligning these embeddings using a bilingual dictionary,making it more straightforward to implement.CLP-Transfer relies on token overlap and the existence of a small pre-trainedmodel using the desired tokenizer.This requirement can be a limitation in cases where:A custom tokenizer is used, which does not have pre-trained models available.The desired tokenizer has minimal overlap with existing ones(e.g., between languages that use different alphabets like English and Arabic).The authors evaluated their method only on decoder-only language models.It would be more comprehensive to include encoder-only and encoder-decoderlanguage models in future evaluations to better understandthe method’s versatility.All tokenizers in the study were trained using Byte-Pair Encoding (BPE).It would be beneficial to investigate the effectiveness of CLP-Transferwith tokenizers trained using other methods such as Unigram and WordPiece.The authors have made the pre-trained model checkpoints and source code publiclyavailable on the HuggingFace Hub(BLOOM-CLP 6.4B,GPT2-XL-CLP 1.5B)and GitHub,respectively. This facilitates reproducibilityand further research by the community.A web-based demo for the German BLOOM-CLP model with 6.4B parameterswas initially provided by the authors. However, it is no longer availableat the time of this publication.