MarkTechPost@AI 2024年09月15日
Small but Mighty: The Enduring Relevance of Small Language Models in the Age of LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近年来,大型语言模型(LLM)在自然语言处理领域取得了显著进展,但其巨大的规模也带来了高昂的计算成本和能源消耗。研究表明,小型语言模型(SLM)在许多任务中可以与 LLM 媲美,同时具有更高的效率和可解释性。本文探讨了 SLM 在 LLM 时代扮演的关键角色,包括与 LLM 的合作和竞争,以及在数据整理、模型评估、领域适应、提示学习和知识蒸馏等方面的应用。

😄 **SLM 与 LLM 的合作:** SLM 在数据整理中发挥重要作用,帮助 LLM 选择高质量的数据子集,解决有限数据可用性问题并提高模型性能。SLM 可以通过小型分类器评估内容质量,并使用代理语言模型计算困惑度得分。在指令微调中,SLM 协助整理更小、更高质量的数据集,有效地将 LLM 与人类偏好对齐。例如,模型导向数据选择 (MoDS) 和 LESS 框架证明了 SLM 如何为 LLM 选择有影响力的数据,优化指令微调过程并以更少的示例实现强对齐。

😊 **SLM 与 LLM 的竞争:** SLM 在特定任务中表现出色,例如知识蒸馏,它利用 LLM 的知识来提高 SLM 的性能。通过将 LLM 的知识转移到更小的模型,KD 方法使强大的 AI 更易于访问和部署。KD 方法可以分为白盒和黑盒方法。白盒蒸馏使用教师 LLM 的内部状态、输出分布和中间特征来透明地训练学生模型。黑盒蒸馏通常使用教师 LLM 生成数据集来微调学生模型。这些技术已成功应用于提高推理能力、增强零样本性能以及解决各种特定领域的任务,证明了 KD 在创建具有成本效益且功能强大的模型方面的多功能性。

😉 **SLM 在特定场景中的优势:** SLM 在计算受限环境、特定任务环境以及需要可解释性的情况下具有优势。在计算受限的环境中,SLM 由于其较小的规模和较低的计算需求,比 LLM 更适合。在特定任务环境中,SLM 可以针对特定任务进行优化,从而在性能和效率方面胜过 LLM。在需要可解释性的情况下,SLM 的可解释性使其在医疗保健和金融等领域具有优势。

🤩 **SLM 的未来趋势:** 随着 LLM 领域的不断发展,SLM 将在未来的发展中发挥越来越重要的作用。SLM 的研究和应用将不断扩展,为解决各种实际问题提供更有效、更经济、更可解释的解决方案。

🥳 **SLM 的应用场景:** SLM 在许多领域都有着广泛的应用场景,例如: * **自然语言处理:** SLM 可以用于各种自然语言处理任务,例如文本分类、机器翻译、文本摘要、问答系统、对话系统等。 * **计算机视觉:** SLM 可以用于各种计算机视觉任务,例如图像分类、目标检测、图像分割、视频分析等。 * **语音识别:** SLM 可以用于各种语音识别任务,例如语音转文字、语音识别、语音合成等。 * **机器学习:** SLM 可以用于各种机器学习任务,例如回归、分类、聚类、降维等。

Large Language Models (LLMs) have revolutionized natural language processing in recent years. The pre-train and fine-tune paradigm, exemplified by models like ELMo and BERT, has evolved into prompt-based reasoning used by the GPT family. These approaches have shown exceptional performance across various tasks, including language generation, understanding, and domain-specific applications. The theory of emergent abilities suggests that increasing model size enhances certain reasoning capabilities, leading to the development of increasingly large models. LLMs have gained widespread popularity, with ChatGPT reaching approximately 180 million users by March 2024.

Despite LLMs’ advancements in artificial general intelligence, their size leads to exponential increases in computational costs and energy consumption. This has sparked interest in smaller language models (SLMs) like Phi-3.8B and Gemma-2B, which achieve comparable performance with fewer parameters. Researchers from Imperial College London and Soda, Inria Saclay have presented the analysis of HuggingFace downloads which reveals that smaller models, especially BERT-base, remain highly popular in practical settings. This surprising trend highlights the continued relevance of SLMs and raises important questions about their role in the LLM era, a topic previously overlooked in research. The persistence of smaller models challenges assumptions about the dominance of large-scale AI.

Small Models (SMs) are defined relative to larger models, with no fixed parameter threshold. SMs are compared to LLMs across four dimensions: accuracy, generality, efficiency, and interpretability. While LLMs excel in accuracy and generality, SMs offer advantages in efficiency and interpretability. SMs can achieve comparable results through techniques like knowledge distillation and often outperform LLMs in specialized tasks. They require fewer resources, making them suitable for real-time applications and resource-constrained environments. SMs are also more interpretable, which is crucial in fields like healthcare and finance. This study examines the role of SMs in the LLM era from two perspectives: collaboration with LLMs and competition against them.

SMs play a crucial role in enhancing LLMs through data curation. For pre-training data, SMs help select high-quality subsets from large datasets, addressing the challenge of finite data availability and improving model performance. Techniques include using small classifiers to assess content quality and proxy language models to calculate perplexity scores. In instruction tuning, SMs assist in curating smaller, high-quality datasets that can effectively align LLMs with human preferences. Methods like Model-oriented Data Selection (MoDS) and the LESS framework demonstrate how SMs can select influential data for LLMs, optimizing the instruction tuning process and achieving strong alignment with fewer examples.

The weak-to-strong paradigm addresses challenges in aligning superhuman LLMs with human values. As LLMs surpass human capabilities in complex tasks, evaluating their outputs becomes increasingly difficult. This paradigm uses smaller models to supervise larger ones, allowing strong models to generalize beyond their weaker supervisors’ limitations. Recent variants include using diverse specialized weak teachers, incorporating reliability estimation, and applying weak models during inference. Techniques like Aligner and Weak-to-Strong Search further enhance alignment by learning correctional residuals or maximizing log-likelihood differences. This approach extends beyond language models to vision foundation models, offering a promising solution for aligning advanced AI systems with human preferences.

Model ensembling strategies utilize both large and small language models to optimize inference efficiency and cost-effectiveness. Two main approaches are model cascading and model routing. Model cascading sequentially uses models of varying complexity, with smaller models handling simpler queries and larger models addressing more complex tasks. Techniques like AutoMix use self-verification and confidence assessment to determine when to escalate queries. Model routing dynamically directs input to the most appropriate models in a pool. Methods like OrchestraLLM and RouteLLM use efficient routers to select optimal models without accessing their outputs. Speculative decoding further enhances efficiency by using a smaller auxiliary model to generate initial predictions, which are then verified by a larger model.

Model-based evaluation approaches use smaller models to assess the performance of LLMs, addressing the limitations of traditional methods like BLEU and ROUGE. Techniques such as BERTSCORE and BARTSCORE employ smaller models to compute semantic similarity and evaluate texts from various perspectives. Some methods use natural language inference models to estimate uncertainty in LLM responses. In addition to that, proxy models can predict LLM performance, reducing computational costs during model selection. These approaches enhance the evaluation of open-ended text generation by LLMs, capturing nuanced semantic meaning and compositional diversity that traditional metrics often miss.

Domain adaptation techniques for LLMs use smaller models to enhance performance in specific domains. White-Box Adaptation methods, like CombLM and IPA, adjust token distributions of frozen LLMs using small, domain-specific models. These approaches modify only the parameters of small experts, allowing LLMs to adapt to specific tasks. Black-Box Adaptation, suitable for API-only services, uses small domain-specific models to guide LLMs through textual knowledge. Retrieval Augmented Generation (RAG) extracts relevant information from external sources, while approaches like BLADE and Knowledge Card use small expert models to generate domain-specific knowledge. These techniques enable LLMs to perform optimally in specialized domains without extensive retraining or access to internal parameters.

RAG enhances LLMs by integrating external knowledge sources to overcome limitations in domain-specific expertise and up-to-date information. RAG methods use lightweight retrievers to extract relevant information from various sources, effectively reducing hallucinations in generated content. These sources can be categorized into three types: textual documents (e.g., Wikipedia, cross-lingual text, domain-specific corpora), structured knowledge (knowledge bases, databases), and other sources (code, tools, images). RAG approaches employ diverse retrieval techniques, including sparse BM25 and dense BERT-based models for textual sources, entity linkers and query executors for structured knowledge, and specialized retrievers for other sources. By utilizing these external resources, RAG significantly enhances LLMs’ performance across various tasks and domains.

Prompt-based learning utilizes LLMs’ ability to adapt to new scenarios with minimal or no labelled data through carefully crafted prompts. This approach utilizes In-Context Learning (ICL), which incorporates demonstration examples within natural language templates without updating model parameters. Small models can be employed to enhance prompts and improve larger models’ performance. Techniques like Uprise and DaSLaM use lightweight retrievers or small models to optimize prompts, break down complex problems, or generate pseudo labels. These methods significantly reduce manual prompt engineering efforts and improve performance across various reasoning tasks. Further, small models can be used to verify or rewrite LLM outputs, achieving performance gains without fine-tuning the larger models.

LLMs can sometimes generate repeated, untruthful, or toxic content. To address these deficiencies, two main approaches using smaller models have emerged: contrastive decoding and small model plug-ins. Contrastive decoding utilizes the differences between a larger “expert” model and a smaller “amateur” model to improve output quality. This technique has been successfully applied to reduce repetition, mitigate hallucinations, enhance reasoning capabilities, and protect user privacy. Small model plug-ins, on the other hand, involve fine-tuning specialized smaller models to address specific LLM shortcomings. These plug-ins can help with issues like handling out-of-vocabulary words, detecting hallucinations, or calibrating confidence scores. Both approaches offer cost-effective ways to improve LLM performance without the need for extensive fine-tuning of the larger models.

Knowledge Distillation (KD) offers an effective solution to enhance smaller models’ performance using the knowledge of LMs. This approach involves training a smaller student model to replicate the behaviour of a larger teacher model, making powerful AI more accessible and deployable. KD methods can be categorized into white-box and black-box approaches. White-box distillation uses internal states, output distributions, and intermediate features of the teacher LLM to train the student model transparently. Black-box distillation typically generates a dataset using the teacher LLM for fine-tuning the student model. These techniques have been successfully applied to improve reasoning capabilities, enhance zero-shot performance, and tackle various domain-specific tasks, demonstrating KD’s versatility in creating cost-effective yet powerful models across multiple applications.

LLMs offer an efficient solution for data synthesis, addressing the limitations of human-created data and the need for task-specific smaller models. This approach focuses on two key areas: Training Data Generation and Data Augmentation. In Training Data Generation, LLMs like ChatGPT create datasets from scratch, which are then used to train smaller, task-specific models. This method has been successfully applied to various tasks, including text classification, clinical text mining, and hate speech detection. Data Augmentation involves using LLMs to modify existing data points, increasing diversity for training smaller models. Techniques include paraphrasing, query rewriting, and generating additional samples for tasks such as personality detection and dialogue understanding. These approaches significantly enhance the performance and robustness of smaller models while maintaining efficiency in inference.

Smaller models prove advantageous in three key scenarios: computation-constrained environments, task-specific environments, and situations requiring interpretability

LLMs, despite their impressive capabilities, face significant challenges in computation-constrained environments due to their substantial computational demands. Scaling model size leads to exponential increases in training time, inference latency, and energy consumption, making LLMs impractical for many academic researchers, businesses with limited resources, and edge or mobile devices. However, not all tasks require such large models. For many tasks that are not knowledge-intensive or don’t demand complex reasoning, smaller models can be equally effective. Research shows diminishing returns from increasing model sizes, particularly in tasks like text similarity and classification. In information retrieval, where faster inference speed is crucial, lightweight models like Sentence-BERT remain widely used. This has led to a growing shift towards smaller, more efficient models like Phi-3.8B, MiniCPM, and Gemma2B, driven by the need for accessibility, efficiency, and democratization of AI technologies.

In task-specific environments, smaller models often prove more effective and efficient than LLMs. This is particularly true in domains with limited available data or specialized requirements. Domain-specific tasks in fields like biomedicine and law benefit from fine-tuned smaller models, which can outperform general LLMs. For tabular learning, where datasets are typically smaller and structured, tree-based models often compete effectively with larger deep-learning models. Short text tasks, such as classification and phrase representation, don’t require extensive background knowledge, making smaller models particularly effective. Further, in niche areas like machine-generated text detection, spreadsheet representation, and information extraction, specialized smaller models can surpass larger ones. These scenarios highlight the advantages of developing lightweight, task-specific models, offering promising returns in specialized domains where data scarcity or unique requirements make large-scale pretraining unfeasible.

Interpretability in machine learning aims to provide human-understandable explanations of a model’s internal reasoning process. Smaller and simpler models generally offer better interpretability compared to larger, more complex ones. Industries like healthcare, finance, and law often prefer more interpretable models because their decisions must be understandable to non-experts. In high-stakes decision-making contexts, easily auditable and explainable models are typically favored. When choosing LLMs or SMs, it’s crucial to balance model complexity with the need for human understanding, making appropriate trade-offs based on the specific application and requirements.

This study analyzes the relationship between LLMs and SMs from two perspectives: collaboration and competition. LLMs and SMs can work together to balance performance and efficiency. They also compete in specific scenarios, such as computation-constrained environments, task-specific applications, and situations requiring high interpretability. Careful evaluation of trade-offs between LLMs and SMs is crucial when selecting models for specific tasks. While LLMs offer superior performance, SMs have advantages in accessibility, simplicity, cost-effectiveness, and interoperability. This research aims to provide insights for practitioners and encourage further study on resource optimization and cost-effective system development, building upon the previous discussion of interpretability in various industries.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post Small but Mighty: The Enduring Relevance of Small Language Models in the Age of LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

小型语言模型 大型语言模型 自然语言处理 人工智能 知识蒸馏 领域适应
相关文章