Data Augmentation for Spoken Grammatical Error Correction

cs.AI updates on arXiv.org 07月28日 12:43

Data Augmentation for Spoken Grammatical Error Correction

本文提出一种自动化生成含语法错误和语音不流畅的语音-文本对的方法，并设计了一系列评估指标，旨在扩充原始语料库，为语音语法纠错提供更多类型错误的数据集，同时不改变第二语言学习者的语言评估分数。

arXiv:2507.19374v1 Announce Type: cross Abstract: While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S\&I Corpus, the first publicly available speech dataset with grammar error annotations.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语音语法纠错数据集生成评估指标语料库扩充第二语言学习

相关文章

Metric Elicitation and Robust Distributed Learning with Sanmi Koyejo - #352

Application Task Driven: LLM Evaluation Metrics in Detail

Show HN: 搜索系统的评估指标

Key Metrics for Evaluating Large Language Models (LLMs)

Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks

8月2日｜热门工作，智能体Agent工具使用及其高质量数据生成方法，APIGen，xLAM

跨平台多模态智能体基准测试来了！但全班第一只考了35.26分

SAM 2最新应用落地！牛津大学团队发布Medical SAM 2，刷新医学图像分割SOTA榜

The Art of AI Persuasion: A Study on Large Language Model Interactions

Character Detection Matching (CDM): A Novel Evaluation Metric for Formula Recognition