热点
"Pre-PPO" 相关文章
This AI Paper from ByteDance Introduces a Hybrid Reward System Combining Reasoning Task Verifiers (RTV) and a Generative Reward Model (GenRM) to Mitigate Reward Hacking
MarkTechPost@AI 2025-04-01T19:30:49.000000Z