MarkTechPost@AI 2024年08月17日
aiXplain Researchers Develop Innovative Approaches for Arabic Prompt Instruction Following with LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究人员为解决阿拉伯语提示数据集缺乏问题,引入两种创新方法,以提升阿拉伯语LLM的性能。

🎯阿拉伯语提示数据集的重要性:LLM训练需要大量与特定用户请求和正确响应配对的提示数据集,而阿拉伯语此类数据集缺乏,限制了LLM在非英语地区的应用。

💡创新方法一:将现有英语提示数据集翻译成阿拉伯语,并进行严格质量评估,保留约20%的高质量翻译提示,形成约2000万条提示数据集。

✨创新方法二:从现有阿拉伯语NLP数据集中直接创建新提示,使用提示源工具为78个公开数据集生成超过6740万条提示。

🚀模型优化:使用新创建的提示对Qwen2 7B模型进行微调,该模型在处理阿拉伯语提示方面表现出色,超越了其他先进模型。

Large language models require large datasets of prompts paired with particular user requests and correct responses for training purposes. LLMs require this for human-like text understanding and generation as the answers to various questions. Conversely, unlike other languages, mainly Arabic, immense efforts have been made to develop such datasets in English. This imbalance in data availability between languages severely restricts the applicability of LLMs to non-English-speaking regions and, therefore, denotes a critical need in the NLP domain.

The recent research challenge this paper addresses is the need for good-quality Arabic prompts datasets to train LLMs to perform well in Arabic. These issues must be addressed so LLMs can effectively understand and generate Arabic text. Therefore, they would be contributing less usefulness to the Arabic-speaking users. This is quite relevant because Arabic is spoken by one of the largest numbers in the world. Yet, it lacks sufficient resources for its language, meaning that present AI technologies serve a huge fraction of mankind. Besides the complexity of the Arabic language, due to its rich morphology and huge number of dialects, it takes a lot of work to develop templates that can portray the language the way it should appropriately. Therefore, creating a highly powerful dataset for Arabic is important to upscale the usefulness of the LLM models to a wider audience.

Current prompt dataset generation approaches are mostly oriented towards English and include manual prompt creation or tools generating them based on existing datasets. For example, PromptSource and Super-NaturalInstructions have made millions of prompts available for English-language LLMs. However, these methods have yet to be adapted on any wide scale for other languages, and hence, the resources for training LLMs in non-English languages are considerably lacking. This, coupled with the limited availability of prompt datasets in languages like Arabic, may have hampered the ability of LLMs to excel in these languages, underlining that more focused efforts in dataset creation are necessary.

Researchers from aiXplain Inc. have introduced two innovative methods for creating large-scale Arabic prompt datasets to address this issue. The first method involves translating existing English prompt datasets into Arabic using an automatic translation system, followed by a rigorous quality assessment process. This method relies on state-of-the-art machine translation technologies and quality estimation tools to ensure that the translated prompts maintain high accuracy. By applying these techniques, researchers retained approximately 20% of the translated prompts, resulting in a dataset of around 20 million high-quality Arabic prompts. The second method focuses on creating new prompts directly from existing Arabic NLP datasets. This method uses a prompt sourcing tool to generate prompts for 78 publicly available Arabic datasets, covering tasks such as answering questions, summarization, and detecting hate speech. Over 67.4 million prompts were created through this process, significantly expanding the resources available for training Arabic LLMs.

The translation-based approach follows an end-to-end pipeline in data processing, starting from the tokenization of the English prompts into sentences further translated into Arabic by a neural machine translation model. Then, it performs quality estimation on such translations using a referenceless machine translation quality estimation model, where each sentence will be attributed some quality score. These prompts will be retained only if the set threshold for quality is met; therefore, the final dataset will be highly accurate. Manual verification is conducted on a random sample of prompts to increase the dataset’s quality further. Another approach is to generate prompts directly; PromptSource creates multiple templates for every task in the Arabic datasets. The approach allows the creation of diverse, contextually relevant prompts desirable for training effective language models.

The researchers then used these newly created prompts to fine-tune an open 7 billion parameter LLM, namely the Qwen2 7B model. The fine-tuned model was tested against several benchmarks and significantly improved handling Arabic prompts, outperforming a state-of-the-art 70 billion parameter instruction-tuned model, Llama3 70B. Specifically, the Qwen2 7B model fine-tuned on just 800,000 prompts achieved a ROUGE-L score of 0.184, while the model fine-tuned on 8 million prompts achieved a score of 0.224. These results highlight the effectiveness of the newly developed prompt datasets and demonstrate that fine-tuning with larger datasets leads to better model performance.

In a nutshell, this research speaks about a grave issue: no datasets of Arabic prompts are available to train large language models. The research has opened up the resources for training Arabic LLMs by introducing two new ways to create such datasets. Fine-tuning the Qwen2 7B model using these newly generated prompts produces a model at the top of all other existing models in terms of performance and places a gold standard for Arabic LLMs. It points to the need to develop robust, scalable methods for creating datasets in languages other than English.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here


The post aiXplain Researchers Develop Innovative Approaches for Arabic Prompt Instruction Following with LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

阿拉伯语提示数据集 创新方法 LLM性能提升
相关文章