MarkTechPost@AI 前天 07:50
This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model Deployment
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

PARSCALE是一种新的语言模型扩展方法,它通过增加模型在训练和推理过程中的并行计算来提高性能,而不是增加模型大小或输出长度。该方法通过将多个可学习的转换应用于输入,使模型并行执行多个前向传递,并动态聚合它们的输出。PARSCALE保留了模型原始的参数数量,并提高了计算多样性,使其成为适用于各种任务和模型架构的自适应解决方案,无需专门的数据集或更改训练协议。实验表明,PARSCALE在降低内存和延迟的同时,能够达到与更大模型相当的性能。

💡PARSCALE通过对同一输入附加多个不同的、可学习的前缀,产生多个并行版本,模型同时处理这些版本,并通过多层感知器计算的动态加权和来聚合输出,从而实现并行计算。

🔑PARSCALE模型使用前缀调整,通过唯一的键值缓存来区分每个并行流,从而实现高效的内存重用。这种设计确保了可扩展性,而无需修改核心架构,并且可以通过仅训练新的前缀和聚合参数来应用于冻结的预训练模型。

📊实验结果表明,在训练了420亿个token后,P=8的模型表现与高达44亿个参数的模型相当,但所需的内存和延迟明显更少。在下游任务中,PARSCALE在GSM8K上提高了高达34%,在MMLU上提高了23%。

Over time, the pursuit of better performance of language models has pushed researchers to scale them up, which typically involves increasing the number of parameters or extending their computational capacity. As a result, the development and deployment of language models now heavily depend on the availability of substantial computational resources and memory.

Despite the advances, increasing model size or generating more tokens to enhance reasoning capabilities leads to significant challenges. Parameter scaling methods like Dense Scaling and Mixture-of-Experts Scaling, which involve increasing the number of trainable weights, demand much larger memory resources. Meanwhile, Inference-time scaling, on the other hand, requires models to generate longer sequences or conduct multiple reasoning steps, which introduces latency and makes deployment slower. While effective, these approaches are not adaptable across all scenarios and fail to address deployment efficiency for low-resource settings such as mobile devices or embedded systems.

Researchers from Zhejiang University and Alibaba Group proposed a new approach termed PARSCALE, which stands for Parallel Scaling. This method shifts focus from increasing model size or output length to increasing the model’s parallel computations during training and inference. By applying multiple learnable transformations to the input, the model executes several forward passes in parallel and aggregates their outputs dynamically. PARSCALE retains the model’s original parameter count and boosts computational diversity, making it an adaptable solution for various tasks and model architectures without requiring specialized datasets or changes in training protocols.

At the technical level, the PARSCALE appends several distinct, learnable prefixes to the same input, producing multiple parallel versions. The model processes these simultaneously, and the outputs are aggregated using a dynamic weighted sum calculated by a multilayer perceptron. This structure introduces only about 0.2% extra parameters per stream, a minor addition compared to full parameter scaling. The model uses prefix tuning to distinguish each parallel stream via unique key-value caches, allowing for efficient memory reuse. The approach also benefits from GPU-friendly parallelization, which helps to keep latency low despite the additional computation. This design ensures scalability without modifying the core architecture and enables application even in frozen pretrained models by only training the new prefix and aggregation parameters.

The researchers conducted extensive experiments on models ranging from 0.5B to 4.4B parameters with parallel streams P set from 1 to 8. When training with 42 billion tokens, models with P = 8 demonstrated performance equivalent to models with up to 4.4 billion parameters, but required significantly less memory and latency. Specifically, on a 1.6B model, PARSCALE used 22× less memory increase and 6× less latency increase compared to parameter scaling for the same performance. On downstream tasks, PARSCALE yielded up to a 34% improvement on GSM8K and 23% on MMLU. Coding performance improved significantly—models with 1.6B parameters and P = 8 achieved results comparable to those of a 4.4B parameter model. The method also proved effective during post-training and parameter-efficient fine-tuning, maintaining high performance even when core model parameters remained unchanged.

This paper introduced a strategy that rethinks how language models can be scaled. Instead of inflating model size or inference steps, it focuses on efficiently reusing existing computation. The researchers’ approach addresses time and memory inefficiencies while maintaining or improving performance. This demonstrates a compelling shift in scaling methods and sets a direction for deploying advanced models in constrained environments using parallel computation effectively.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model Deployment appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

PARSCALE 并行计算 语言模型 模型部署 AI
相关文章