MarkTechPost@AI 2024年10月31日
CodeFavor: A Machine Learning Framework that Trains Pairwise Preference Models with Synthetic Code Preferences Generated from Code Evolution like Code Commits and Code Critiques
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

CODEFAVOR是一个用于训练代码偏好模型的框架,同时还有CODEPREFBENCH评估基准。它解决了LLM生成代码质量评估的难题,采用多种方法生成数据并进行模型训练和评估,涉及代码正确性、效率、安全性等方面。

🎯CODEFAVOR采用成对建模方法,基于用户指定标准预测代码对的偏好。它使用解码器的转换器学习代码偏好,处理包含指令、两个代码候选和特定标准的输入。

💻该框架引入两种创新的合成数据生成方法:Commit-Instruct将提交前后的代码片段转化为偏好对,Critic-Evol通过改进错误代码样本生成偏好数据。

📋CODEPREFBENCH评估基准包含1364个精心策划的偏好任务,用于评估代码的正确性、效率、安全性和开发者的一般偏好等方面。

👨‍💻人类注释工作显示,开发者团队经验丰富,但在评估效率和安全性方面存在困难。对现有LLM的评估表明,CODEFAVOR能显著提升较小模型的性能。

Large Language Models (LLMs) have revolutionized software development by enabling code completion, functional code generation from instructions, and complex code modifications for bug fixes and feature implementations. While these models excel at generating code from natural language instructions, significant challenges persist in evaluating the quality of LLM-generated code. The critical aspects requiring assessment include code correctness, efficiency, security vulnerabilities, adherence to best practices, and alignment with developer preferences. The evaluation process becomes particularly complex when balancing these multiple quality dimensions simultaneously. The systematic study of code preferences and the development of effective preference models still needs to be explored despite its crucial role in optimizing LLM performance and ensuring that generated code meets real-world development standards.

Preference optimization has emerged as a crucial step in aligning LLMs with desired outcomes, employing both offline and online algorithms to enhance model performance. Previous approaches have primarily relied on collecting preference data through paired comparisons of preferred and rejected responses. These methods typically gather data through human annotations, LLM feedback, code execution results, or existing preference models. While some techniques have explored training LLM-as-a-Judge systems, these approaches have largely focused on natural language generation rather than specialized code generation. The existing methods face particular challenges in the code domain, where preference principles are more specialized and complex, involving technical aspects like efficiency and security that are significantly more difficult to evaluate than general language preferences. The labeling process for code preferences presents unique challenges that existing approaches have not adequately addressed.

The researchers from the University of Illinois Urbana-Champaign and AWS AI Labs have developed CODEFAVOR, a robust framework for training code preference models, alongside CODEPREFBENCH, a comprehensive evaluation benchmark. CODEFAVOR implements a pairwise modeling approach to predict preferences between code pairs based on user-specified criteria. The framework introduces two innovative synthetic data generation methods: Commit-Instruct, which transforms pre- and post-commit code snippets into preference pairs, and Critic-Evol, which generates preference data by improving faulty code samples using a critic LLM. The evaluation framework, CODEPREFBENCH, comprises 1,364 carefully curated preference tasks that assess various aspects, including code correctness, efficiency, security, and general developer preferences. This dual approach addresses both the technical challenge of building effective preference models and the empirical question of understanding how human annotators and LLMs align in their code preferences.

The CODEFAVOR framework implements a sophisticated pairwise modeling approach using decoder-based transformers for learning code preferences. The model processes input comprising an instruction, two code candidates, and a specific criterion formatted in a structured prompt. The framework offers two distinct output designs: a classification approach that makes binary predictions through a single next-token probability comparison and a generative approach that provides natural language explanations for preference decisions. The architecture incorporates two innovative synthetic data generation methods: Commit-Instruct, which processes raw code commits through a three-step pipeline of reasoning, filtering, and rephrasing, and Critic-Evol, which generates preference data through a three-stage process of fault sampling, critique filtering, and code revision. In the Commit-Instruct pipeline, a critic LLM analyzes commits to transform them into training samples, while Critic-Evol utilizes the interaction between a weaker draft model and a stronger critic model to generate synthetic preference pairs.

The researchers have conducted a comprehensive evaluation of code preference models, including insights from human developer annotations as well as comparisons between existing LLMs and the proposed CODEFAVOR framework.

The human annotation efforts reveal several key insights. The developer team consists of experienced programmers, with two-thirds holding computer science degrees and 95% having over 2 years of coding experience. The developers exhibit high confidence in their annotations, particularly for code correctness, though they struggle more with evaluating efficiency and security aspects. The annotation process is time-consuming, with each task taking an average of 7.8 minutes per developer.

In terms of accuracy, human developers excel at identifying correct code, achieving an 84.9% solve rate. However, their performance drops for efficiency (74.9%) and is weakest for security (59.7%), as they struggle to accurately assess non-functional code properties that may require specialized expertise.

The researchers then evaluate a range of existing LLMs, including large-scale models like Llama-3.1-405B-Instruct and smaller models like Gemma-2-9B-Instruct. While the larger models generally outperform the smaller ones, the CODEFAVOR framework is able to significantly improve the performance of the smaller models, in some cases even surpassing the larger critic models.

Specifically, CODEFAVOR improves the overall performance of the smaller 7-12B models by 9.3-28.8% relative to their baseline performance. For code correctness, CODEFAVOR boosts the smaller models by 8.8-28.7%, allowing them to surpass the performance of the critic model (Llama-3-70B-Instruct) by up to 12%. Similar improvements are observed for efficiency and security preferences.

Importantly, the CODEFAVOR models not only demonstrate strong performance but also offer significant cost advantages. While human annotation costs an estimated $6.1 per task, the CODEFAVOR classification model fine-tuned on Mistral Nemo Instruct is five orders of magnitude cheaper, at 34 times less expensive than the Llama-3-70B-Instruct critic model, while achieving comparable or better preference results.

The researchers have introduced CODEFAVOR, a robust framework for training pairwise code preference models using synthetic data generated from code commits and LLM critiques. They curated CODEPREFBENCH, a benchmark of 1,364 code preference tasks, to investigate the alignment between human and LLM preferences across correctness, efficiency, and security. CODEFAVOR significantly boosts the ability of smaller instruction-following models to learn code preferences, achieving on-par performance with larger models at a fraction of the cost. The study offers insights into the challenges of aligning code generation preferences across multiple dimensions.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Trending] LLMWare Introduces Model Depot: An Extensive Collection of Small Language Models (SLMs) for Intel PCs

The post CodeFavor: A Machine Learning Framework that Trains Pairwise Preference Models with Synthetic Code Preferences Generated from Code Evolution like Code Commits and Code Critiques appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CODEFAVOR 代码偏好 模型训练 评估基准
相关文章