MarkTechPost@AI 2024年10月10日
CodeMMLU: A Comprehensive Multi-Choice Benchmark for Assessing Code Understanding in Large Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

CodeMMLU是为评估LLMs中软件和代码理解深度而提出的综合多选题问答基准。它解决了传统评估方法的问题,强调代码理解与有效生成的关键关系,具有全面性、多样性等特点,对推进AI辅助软件开发有重要意义。

🎯CodeMMLU旨在评估LLMs的代码理解能力,弥补传统方法的不足,提供更深入的模型理解洞察,其评估模型推理代码的能力,而非单纯生成代码。

🌟CodeMMLU具有全面性,包含超10,000个问题,数据集广泛且无偏见,涵盖多种资源及广泛的软件知识。

🌈CodeMMLU具有多样性,涵盖多种任务、领域和语言,包括通用QA、代码生成、缺陷检测等,涉及超10种编程语言。

📋CodeMMLU分为知识型测试集和现实编程问题两类,知识型子集涵盖从高级软件原则到低级编程语言语法等多方面内容,还包括多种多选题类型以测试关键编码技能。

💡实验表明CodeMMLU中知识型任务表现与现实世界编码挑战有强相关性,但也存在一些局限性,如不能完全测试模型创造性编码能力等。

Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. Traditional evaluation methods might need to be updated and susceptible to data leakage, leading to unreliable assessments. Moreover, practical applications of CodeLLMs reveal limitations such as bias and hallucination.

To resolve these problems, a group of researchers from FPT Software AI Center, Vietnam, Hanoi University of Science and Technology, VNU-HCM- University of Science has proposed CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. Unlike traditional benchmarks, CodeMMLU assesses models’ ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU is a vital resource for advancing AI-assisted software development, aiming to create more reliable and capable coding assistants.

CodeMMLU offers a robust and easily evaluable methodology with two key features: 

CodeMMLU highlights the impact of factors such as model size, model family, and prompting techniques. It provides essential information to the community on effectively utilizing LLMs for specific tasks and domains in software engineering.

It is divided into two primary categories. First are knowledge-based test sets containing syntactic and semantic tasks, and second are real-world programming problems. The knowledge-based subset covers many topics, from high-level software principles concepts to low-level programming language grammar. Several programming-related MCQs are collected from high-quality platforms, like GeeksforGeeks W3Schools.

It is further categorized into a Syntactic set, which focuses on programming language grammar like iteration format common library use. At the same time, the Semantic one is more targeted at algorithms, OOPS, and data structures. A deep learning model filters out low-quality or irrelevant questions, such as duplicates or trivial questions. The remaining questions were further refined using manual and deep learning methods. 

The benchmark includes five multiple-choice question types that test essential coding skills: Code completion, Code repair, Defect Detection, and Fill in the blank.

Certain experiments revealed a strong correlation between performance on knowledge-based tasks and real-world coding challenges. Specifically, the Pearson correlation score r = 0.61 between model rankings on the knowledge test set and their performance on real-world problems, derived from the accuracy of 43 LLMs across 10 model families, indicated a moderate alignment and demonstrated that a deeper understanding of software principles consistently excel in real-world coding tasks. Also, The LLM accuracy fluctuates between different permutations ( Δ? = 36.66), demonstrating how sensitive it can be to the structure and order of answers.

In conclusion, CodeMMLUs strongly correlate software knowledge and real-world task performance. CodeMMLU provides more accurate and detailed rankings of LLMs, particularly in open-source models. Focusing on understanding rather than mere generation, gives a more nuanced and comprehensive assessment of model capabilities across a wide range of software knowledge and real-world programming tasks. However, there are limitations, such as the Multiple Choice Questions not being fully able to test the model’s ability to create code creatively. Also, the benchmark could still include more specialized areas of software development to assess the model’s versatility. In future work, the researchers plan to focus on adding more complex tasks and refining the balance between real-world scenarios and theoretical knowledge.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post CodeMMLU: A Comprehensive Multi-Choice Benchmark for Assessing Code Understanding in Large Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CodeMMLU 代码理解 评估基准 LLMs 软件知识
相关文章