热点
关于我们
xx
xx
"
LLMs评估
" 相关文章
StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation
cs.AI updates on arXiv.org
2025-07-30T04:46:13.000000Z
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles
cs.AI updates on arXiv.org
2025-07-22T04:34:38.000000Z
MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark
cs.AI updates on arXiv.org
2025-06-30T04:14:30.000000Z
ArabLegalEval: A Multitask AI Benchmark Dataset for Assessing the Arabic Legal Knowledge of LLMs
MarkTechPost@AI
2024-08-19T09:49:42.000000Z