MarkTechPost@AI 06月23日 15:10
VERINA: Evaluating LLMs on End-to-End Verifiable Code Generation with Formal Proofs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

VERINA是一个用于评估LLM在可验证代码生成方面表现的基准。该基准包含189个编程挑战,涵盖了代码生成、规范生成和证明生成三个关键任务,并提供了详细的问题描述、代码、规范、证明和测试套件。研究表明,虽然LLM在代码生成方面表现较好,但在证明生成方面仍面临挑战。VERINA的提出,为评估LLM在可验证代码生成方面的能力提供了新的标准。

💡LLM在代码生成中表现出色,但由于其概率性,无法提供形式化的代码保证,生成的代码常常包含错误,这会成为生产力的瓶颈。

✨VERINA(可验证代码生成竞技场)是一个高质量的基准,用于评估可验证的代码生成,它包含189个编程挑战,包括详细的问题描述、代码、规范、证明和测试套件,所有内容均以Lean格式编写。

🌱VERINA数据集包含两个子集:VERINA-BASIC和VERINA-ADV。VERINA-BASIC包含108个问题,源自人类编写的Dafny代码;VERINA-ADV包含81个更高级的编码问题,来自定理证明课程的学生提交。

🧐对9个先进LLM在VERINA上的评估结果显示,代码生成成功率最高,其次是规范生成,而证明生成仍然最具挑战性,所有模型的pass@1率低于3.6%。

🚩VERINA强调简单、独立的任务,适合于基准测试,但并不能完全代表复杂的现实世界验证项目。未来,规范生成指标可以通过结合更强大的证明器来改进。

LLM-Based Code Generation Faces a Verification Gap

LLMs have shown strong performance in programming and are widely adopted in tools like Cursor and GitHub Copilot to boost developer productivity. However, due to their probabilistic nature, LLMs cannot provide formal guarantees for the code generated. The generated code often contains bugs, and when LLM-based code generation is adopted, these issues can become a productivity bottleneck. Developing suitable benchmarks to track progress in verifiable code generation is important but challenging, as it involves three interconnected tasks: code generation, specification generation, and proof generation. Current benchmarks fall short, as they lack support for all three tasks, quality control, robust metrics, and modular design.

Existing Benchmarks Lack Comprehensive Support for Verifiability

Benchmarks like HumanEval and MBPP have good progress on LLM-based code generation, but do not handle formal specifications or proofs. Many verification-focused efforts target only one or two tasks and assume other elements to be provided by humans. DafnyBench and miniCodeProps are designed for proof generation, while AutoSpec and SpecGen infer specifications and proofs from human-written code. Interactive theorem-proving systems, such as Lean, provide a promising target for verifiable code generation with LLMs, as they support the construction of proofs with intermediate steps. However, existing verification benchmarks in Lean, such as miniCodeProps and FVAPPS, have limitations in task coverage and quality control.

Introducing VERINA: A Holistic Benchmark for Code, Spec, and Proof Generation

Researchers from the University of California and Meta FAIR have proposed VERINA (Verifiable Code Generation Arena), a high-quality benchmark to evaluate verifiable code generation. It consists of 189 programming challenges with detailed problem descriptions, code, specifications, proofs, and test suites, that are formatted in Lean. VERINA is constructed with quality control, drawing problems from sources such as MBPP, LiveCodeBench, and LeetCode to offer different difficulty levels. All samples are manually reviewed and refined to ensure clear natural language descriptions, precise formal specifications, and accurate code implementations. Each sample contains test suites to cover positive and negative scenarios, with 100% line coverage of the code implementation and passing ground truth specifications.

Structure and Composition of the VERINA Dataset

VERINA consists of two subsets with diverse difficulty levels: VERINA-BASIC and VERINA-ADV. VERINA-BASIC contains 108 problems translated from human-written Dafny code. This comprises 49 problems from MBPP-DFY50 and 59 additional instances from CloverBench, translated using OpenAI o3-mini with few-shot prompting, and followed by inspection. VERINA-ADV contains 81 more advanced coding problems from student submissions in a theorem-proving course, where students sourced problems from platforms like LeetCode and LiveCodeBench, then formalized solutions in Lean. Moreover, VERINA employs rigorous quality assurance, including detailed problem descriptions, full code coverage with positive tests, and complete test pass rates on ground truth specifications, etc.

Performance Insights: LLM Evaluation on VERINA Highlights Key Challenges

The evaluation of nine state-of-the-art LLMs on VERINA reveals a clear tough hierarchy. Code generation achieves the highest success rates, followed by specification generation, while proof generation remains most challenging, with pass@1 rates below 3.6% for all models. VERINA-ADV is more difficult compared to VERINA-BASIC in all three tasks, highlighting that increased problem complexity significantly impacts the performance of verifiable code generation. Iterative proof refinement with o4-mini shows an improvement from 7.41% to 22.22% for simpler problems on VERINA-BASIC after 64 iterations, though gains are limited on VERINA-ADV. Providing ground truth specifications enhances code generation, indicating that formal specifications can effectively constrain and direct the synthesis process.

Conclusion: VERINA Sets a New Standard in Verifiable Code Evaluation

In conclusion, researchers introduced VERINA, an advancement in benchmarking verifiable code generation. It offers 189 carefully curated examples with detailed task descriptions, high-quality code, specifications in Lean, and extensive test suites with full line coverage. However, the dataset is still relatively small for fine-tuning tasks, requiring scaling through automated annotation with LLM assistance. VERINA emphasizes simple, standalone tasks suitable for benchmarking but not fully representative of complex real-world verification projects. The specification generation metric could be improved in the future by incorporating more capable provers, including those based on LLMs or SMT solvers to handle complex soundness and completeness relationships, effectively.


Check out the Paper, Dataset Card, GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post VERINA: Evaluating LLMs on End-to-End Verifiable Code Generation with Formal Proofs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 代码生成 可验证性 VERINA
相关文章