Linkpost: Predicting Empirical AI Research Outcomes with Language Models

少点错误 06月05日 02:17

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文介绍了一种用于预测AI研究想法成功率的新基准。该基准通过比较不同研究想法在基准测试中的表现，来评估其成功可能性。研究人员构建了一个系统，结合了微调的GPT-4.1模型和论文检索代理，并在自然语言处理（NLP）领域取得了显著成果，超越了人类专家。该系统在完整测试集上实现了77%的准确率，同时验证了其对新颖想法的预测能力。研究结果表明，这种方法可以加速AI研究，并为改进想法生成模型提供了一种奖励机制。

💡该研究构建了首个用于预测AI研究想法成功率的基准，旨在加速经验性AI研究。该基准通过比较两个研究想法在特定基准测试中的表现，来预测哪个想法会更好。

🔬研究使用了从会议论文中收集的研究想法和实验结果，包括1,585对经过人工验证的想法对用于测试，以及6,000对用于训练。系统结合了微调的GPT-4.1模型和论文检索代理。

🏆在NLP领域，该系统在预测研究想法成功率方面超越了人类专家，准确率达到64.4%，而人类专家为48.9%。在完整测试集上，该系统达到了77%的准确率，而现成的先进语言模型（如o3）即使使用相同的检索增强，其表现也并不优于随机猜测。

🤖该系统在未发表的新颖想法上进行了评估，包括由AI构思代理生成的想法。结果显示，该系统实现了63.6%的准确率，表明其有潜力作为奖励模型，用于改进想法生成模型。

Published on June 4, 2025 6:14 PM GMT

Abstract (emphasis mine):

Many promising-looking ideas in AI research fail to deliver, but their validation takes substantial human labor and compute. Predicting an idea's chance of success is thus crucial for accelerating empirical AI research, a skill that even expert researchers can only acquire through substantial experience. We build the first benchmark for this task and compare LMs with human experts. Concretely, given two research ideas (e.g., two jailbreaking methods), we aim to predict which will perform better on a set of benchmarks. We scrape ideas and experimental results from conference papers, yielding 1,585 human-verified idea pairs published after our base model's cut-off date for testing, and 6,000 pairs for training. We then develop a system that combines a fine-tuned GPT-4.1 with a paper retrieval agent, and we recruit 25 human experts to compare with. In the NLP domain, our system beats human experts by a large margin (64.4% v.s. 48.9%). On the full test set, our system achieves 77% accuracy, while off-the-shelf frontier LMs like o3 perform no better than random guessing, even with the same retrieval augmentation. We verify that our system does not exploit superficial features like idea complexity through extensive human-written and LM-designed robustness tests. Finally, we evaluate our system on unpublished novel ideas, including ideas generated by an AI ideation agent. Our system achieves 63.6% accuracy, demonstrating its potential as a reward model for improving idea generation models. Altogether, our results outline a promising new direction for LMs to accelerate empirical AI research.

I didn't read the paper in detail, but it might suggest both good news and bad news. Good news is that now models do not acquire better expertise in AI research simply in virtue of being SOTA, you need specialized fine-tuning. Bad news is that it can become a start of self-improving loop: better AI researcher models can save labor and compute by picking better ideas earlier, enabling faster creation of better model.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签