CLP-Transfer: Cross-Lingual and Progressive Transfer Learning

Content feed of the TransferLab — appliedAI Institute 2024年11月27日

CLP-Transfer: Cross-Lingual and Progressive Transfer Learning

CLP-Transfer方法提出了一种新颖的跨语言语言迁移方法，通过利用标记重叠和一个具有所需标记器的预训练小模型来简化迁移过程，无需fastText嵌入或双语词典。该方法利用源语言模型的标记嵌入和目标语言模型的标记嵌入之间的重叠，以及不同大小但具有相同标记器的模型之间标记嵌入的可比性。尽管在实践中具有优势，但该方法在下游任务中的性能有限，突出了未来研究和评估的领域。

🤔 **词汇重叠假设:** CLP-Transfer假设源语言和目标语言的标记器及其词汇之间存在显著的标记重叠，并利用重叠的标记嵌入作为锚点来计算非重叠标记的嵌入。

📚 **标记嵌入可比性假设:** 该方法假设不同大小但具有相同标记器的模型的标记嵌入是可比的，可以相互替换进行计算。通过分析不同大小的英语OPT模型的标记嵌入的k近邻，验证了这一假设。

⚙️ **迁移过程:** CLP-Transfer方法首先复制源模型中与目标模型词汇重叠的标记嵌入，然后利用一个较小的辅助模型计算目标模型中非重叠标记的嵌入，并将剩余的模型参数复制过来，类似于WECHSEL方法。

📊 **实验模型:** 作者在实验中使用了GPT-2和BLOOM等模型，分别以英语、阿拉伯语、芬兰语、德语等语言作为目标语言，并使用不同大小的模型进行测试。

⚠️ **性能限制:** 尽管CLP-Transfer方法简化了跨语言语言迁移过程，但在下游任务中的性能仍然有限，需要进一步的研究和评估。

The CLP-Transfer method introduces a novel approach for cross-lingual language transfer by leveraging token overlap and a small pre-trained model with the desired tokenizer, simplifying the transfer process without the need for fastText embeddings or bilingual dictionaries. Despite its practical advantages, the method’s performance on downstream tasks is limited, highlighting areas for future research and evaluation.In [Min22W], we explored WECHSEL,a cross-lingual transfer method leveragingfastText embeddings and bilingual dictionaries to initialize a new language modelin a target language from trained models in one or more source languages.In this post, we examine Cross-Lingual and Progressive Transfer (CLP-Transfer),introduced by [Ost23E]. This method usesa smaller language model in the target language with the desired tokenizerfor cross-lingual transfer.Key Assumptions of CLP-Transfer1. Vocabulary OverlapGiven tokenizers and their respective vocabularies $V_s$ and $V_t$for source and target languages, respectively, CLP-Transfer assumesa significant token overlap, i.e., $| V_s \cap V_t| \gg 0$.Table 1 shows the result of an empirical experimentthat demonstrates this assumption holds across varioustokenizers and languages.Overlapping token embeddings from the source modelare copied to the target model:$$\mathbf{u}_t := \mathbf{u}_s,\quad \text{if} \quad v \in V_s \cap V_t$$These overlapping tokens serve as anchors for computing embeddingsof non-overlapping tokens.Table 1: Number of overlapping vocabulary tokensbetween different tokenizers, normalized by the sourcevocabulary size. The tokenizers are English GPT-2, Arabic GPT-2,Finnish GPT-23, multilingual BLOOM, multilingual XGLM, andthe authors’ German tokenizer.2. Comparable Token EmbeddingsToken embeddings are assumed to be comparable across modelsof different sizes but with the same tokenizer, i.e., they can besubstituted for one another in computations.Figure 1 shows the results of an experiment comparing token embeddingsacross various sizes of English OPT models. The set of k-nearest neighbors$N_v$ with $k = 10$ for each token $v$ was analyzed to measureoverlapping neighbors across different model sizes$N_v^{(large)} \cap N_v^{(small)}$.This measure is normalized and computed for all available tokens.Figure 1: Similarity of token embeddings of differentOPT model sizes measured as overlapping k = 10 nearest neighborsfor all tokens in the vocabulary.This finding justifies using a smaller helper model to computethe cosine similarity between its embeddings and those of the source modelas a proxy for the larger target model:$$\mathbf{u}^{\text{(large)}}_t :=\underset{\widehat{v} \in V_s \cap V_t}{\sum}\frac{\widehat{\mathbf{u}}^{\text{(large)}}_s}{\delta(\mathbf{u}_t, \widehat{\mathbf{u}}_t)},\quad \text{if} \quad v \notin V_s \cap V_t$$Where the weight function $\delta$ has the objective to transferthe spatial properties from the small model to the large modeland is defined as follows:$$\delta(\mathbf{u}, \widehat{\mathbf{u}}) :=\frac{\cos(\mathbf{u}^{\text{(small)}}_t, \widehat{\mathbf{u}}^{\text{(small)}}_t)}{\underset{\widehat{v}^{\prime} \in V_s \cap V_t; \quad v^{\prime} \in V_s \cup V_t}{\sum}\cos(\mathbf{u}^{\prime\text{(small)}}_t, \widehat{\mathbf{u}}^{\prime\text{(small)}}_t)}$$The remaining model parameters are copied over as in WECHSEL [Min22W].ExperimentsModelsFor their experiments, the authors used the following models:GPT-2:Source: English GPT-2-XL model with 1.5B parameters.Helper (small): GPT-2-base model with 117M parametersinitialized with WECHSEL and then further trained.Target: GPT-2 models with parameter sizes from 117M to 1.5B.BLOOM:Source: Multilingual BLOOM model with 7.1B parameters.Helper (small): German BLOOM model with 1.5B parameters.Target: BLOOM models with parameter sizes from 1.5B to 6.4B.DatasetsGPT-2Data: Web-crawled data from the German subset of OSCAR v2019, similarto WECHSEL[Min22W].Training Set: First 4GB of the data, approximately 30.8B tokens.Validation Set: Next 0.4GB of the data.BLOOMData: Web-crawled content from the German subset of OSCAR v22.01(excluding headers, footers, noisy, and adult content)and the GC4 Corpus (including only head and middle parts).Deduplication: Removed duplicated content from CommonCrawl.Additional Data: German court decisions from Open Legal Data.Training Set: Approximately 50.4B tokens.EvaluationThe trained models were evaluated by:Perplexity on the GPT-2 validation dataset.Zero-shot performance on German downstream tasks:Sentiment analysis from GermEval 2017.Hate speech classification from GermEval 2018.News topic classification from GNAD10.Paraphrase identification from PAWSX.Natural language inference from XNLI.Stance detection from X-Stance.MethodsThe authors compared their method against three others:From-Scratch Training (referred to as FullRand in WECHSEL)Randomly initializes the target model and trains it from scratch.WECHSELRandom Uniformly chooses a class from the downstreamclassification tasks without using a model.Additionally, the authors compared their monolingual German models againstmultilingual models trained on German data, namely XGLM and mGPT.ResultsTable 2: Evaluation results of German downstream tasksin a zero-shot setting. The average score excludesthe OSCAR validation perplexity (PPL). Smaller models are on par or worsethan the random baseline. The BLOOM-CLP 6.4B modelachieves the best results on average but not in all benchmarks.Perplexity EvaluationAs shown in Table 2, all models initializedwith CLP-Transfer achieve the best evaluation perplexity on the OSCAR dataset.This is further illustrated in Figure 2and Figure 3.Figure 2: GPT-2-XL German (1.5B parameters).Validation perplexity with respect to the number of tokens comparingFrom-Scratch Training, WECHSEL, and CLP-Transfer.CLP-Transfer achieves the same perplexity as from-scratch trainingwith fewer tokens.GPT-2-XL:Figure 2 shows that the GPT-2-XL model initializedwith CLP-Transfer achieves the same perplexity as the from-scratch trainingafter training on only ~50% of the total number of tokens (dashed line).The difference between CLP-Transfer and WECHSEL is noticeable,favoring CLP-Transfer.BLOOM:Figure 3 shows that the BLOOM-6B-German modelinitialized with CLP-Transfer achieves the same perplexityas from-scratch training after training on only ~20% ofthe total number of tokens (dashed line).Figure 3: BLOOM-6B-German. Validation perplexitywith respect to the number of tokens comparing From-Scratch Trainingand CLP-Transfer. CLP-Transfer achieves the same perplexityas from-scratch training with fewer tokens.Downstream Task PerformanceThe results on downstream tasks are generally disappointing,as noted by the authors. The models initialized with CLP-Transferdo not perform significantly better than the random baseline on these tasksand perform even worse on some of the tasks.The GPT-2-XL-CLP model does not achieve the best results on any specific datasetand performs significantly worse on the hate speech classification taskfrom GermEval 2018.The seemingly strange discrepancy between the good perplexity score onthe validation set and the poor results on the downstream taskscan be explained by several factors:Dataset Splits: Both the training and validation sets come fromthe same dataset (OSCAR) with splits based on the order of the data ratherthan content. This can lead to better perplexity scores thatdo not necessarily translate to improved performance on downstream tasks.Perplexity as a Proxy: Perplexity is merely a proxy measureand does not necessarily correlate with performance on specific tasks.Additionally, the authors mention two other points that contribute tothe poor performance on downstream tasks:Lack of Fine-Tuning and Prompt Engineering: The models were neitherfine-tuned nor subjected to prompt engineering, which is crucial for achievinggood performance, especially given the model sizes and number of training tokens used.Dataset Quality: The quality of the datasets used for validationis variable. Some datasets, such as PAWSX, contain poorly translatedsamples, leading to less meaningful results.Although CLP-Transfer does not necessarily achieve better results than WECHSELon downstream tasks, it is signficantly simpler. It does not require trainingfastText embeddings or aligning these embeddings using a bilingual dictionary,making it more straightforward to implement.CLP-Transfer relies on token overlap and the existence of a small pre-trainedmodel using the desired tokenizer.This requirement can be a limitation in cases where:A custom tokenizer is used, which does not have pre-trained models available.The desired tokenizer has minimal overlap with existing ones(e.g., between languages that use different alphabets like English and Arabic).The authors evaluated their method only on decoder-only language models.It would be more comprehensive to include encoder-only and encoder-decoderlanguage models in future evaluations to better understandthe method’s versatility.All tokenizers in the study were trained using Byte-Pair Encoding (BPE).It would be beneficial to investigate the effectiveness of CLP-Transferwith tokenizers trained using other methods such as Unigram and WordPiece.The authors have made the pre-trained model checkpoints and source code publiclyavailable on the HuggingFace Hub(BLOOM-CLP 6.4B,GPT2-XL-CLP 1.5B)and GitHub,respectively. This facilitates reproducibilityand further research by the community.A web-based demo for the German BLOOM-CLP model with 6.4B parameterswas initially provided by the authors. However, it is no longer availableat the time of this publication.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

跨语言迁移 CLP-Transfer 语言模型标记器嵌入

相关文章

Coalition of news publishers sue Microsoft and OpenAI

This AI Paper by Microsoft and Tsinghua University Introduces YOCO: A Decoder-Decoder Architectures for Language Models

OLMo: Everything You Need to Train an Open Source LLM with Akshita Bhagia - #674

Multilingual LLMs and the Values Divide in AI with Sara Hooker - #651

BloombergGPT - an LLM for Finance with David Rosenberg - #639

AI Trends 2023: Reinforcement Learning - RLHF, Robotic Pre-Training, and Offline RL with Sergey Levine - #612

Scaling BERT and GPT for Financial Services with Jennifer Glore - #561

Using Brain Imaging to Improve Neural Networks with Alona Fyshe - #513

Can Language Models Be Too Big? ? with Emily Bender and Margaret Mitchell - #467

Building LLM Agents Using LangChain & OpenAI API