MarkTechPost@AI 03月24日 02:25
A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the Neural Basis of Natural Language Processing in Everyday Conversations
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一项关于大脑语言处理机制的研究,该研究利用电皮质图记录技术,结合多模态语音转文本模型,构建了一个统一的计算框架。该框架能够将声学、语音和词汇层面的语言结构联系起来,从而深入研究大脑在日常对话中的神经基础。研究发现,该模型能够准确预测大脑在语音产生和理解过程中的神经活动,为理解人类语言处理提供了新的视角,并强调了基于统计学习和高维嵌入空间在语言习得中的作用。

🗣️ 研究提出了一种统一的计算框架,连接声学、语音和词汇层面的语言结构,以研究大脑在日常对话中的神经基础。

🧠 研究使用电皮质图记录技术,记录了100小时的自然语言,并提取了声学、语音和语言嵌入,用于预测神经活动。

💡 研究发现,该模型能够准确预测大脑在语音产生和理解过程中的神经活动,揭示了大脑语言处理的层次性。

⏱️ 研究表明,语音嵌入在预测感知和发音区域的活动方面表现更好,而语言嵌入则更擅长预测高级语言区域的活动。

🚀 该研究强调了基于统计学习和高维嵌入空间在语言习得中的作用,并预示了未来语言模型的发展方向。

Language processing in the brain presents a challenge due to its inherently complex, multidimensional, and context-dependent nature. Psycholinguists have attempted to construct well-defined symbolic features and processes for domains, such as phonemes for speech analysis and part-of-speech units for syntactic structures. Despite acknowledging some cross-domain interactions, research has focused on modeling each linguistic subfield in isolation through controlled experimental manipulations. This divide-and-conquer strategy shows limitations, as a significant gap has emerged between natural language processing and formal psycholinguistic theories. These models and theories struggle to capture the subtle, non-linear, context-dependent interactions occurring within and across levels of linguistic analysis.

Recent advances in LLMs have dramatically improved conversational language processing, summarization, and generation. These models excel in handling syntactic, semantic, and pragmatic properties of written text and in recognizing speech from acoustic recordings. Multimodal, end-to-end models represent a significant theoretical advancement over text-only models by providing a unified framework for transforming continuous auditory input into speech and word-level linguistic dimensions during natural conversations. Unlike traditional approaches, these deep acoustic-to-speech-to-language models shift to multidimensional vectorial representations where all elements of speech and language are embedded into continuous vectors across a population of simple computing units by optimizing straightforward objectives.

Researchers from Hebrew University, Google Research, Princeton University, Maastricht University, Massachusetts General Hospital and Harvard Medical School, New York University School of Medicine, and Harvard University have presented a unified computational framework that connects acoustic, speech, and word-level linguistic structures to investigate the neural basis of everyday conversations in the human brain. They utilized electrocorticography to record neural signals across 100 hours of natural speech production and detailed as participants engaged in open-ended real-life conversations. The team extracted various embedding like low-level acoustic, mid-level speech, and contextual word embeddings from a multimodal speech-to-text model called Whisper. Their model predicts neural activity at each level of the language processing hierarchy across hours of previously unseen conversations.

The internal workings of the Whisper acoustic-to-speech-to-language model are examined to model and predict neural activity during daily conversations. Three types of embeddings are extracted from the model for every word patients speak or hear: acoustic embeddings from the auditory input layer, speech embeddings from the final speech encoder layer, and language embeddings from the decoder’s final layers. For each embedding type, electrode-wise encoding models are constructed to map the embeddings to neural activity during speech production and comprehension. The encoding models show a remarkable alignment between human brain activity and the model’s internal population code, accurately predicting neural responses across hundreds of thousands of words in conversational data.

The Whisper model’s acoustic, speech, and language embeddings show exceptional predictive accuracy for neural activity across hundreds of thousands of words during speech production and comprehension throughout the cortical language network. During speech production, a hierarchical processing is observed where articulatory areas (preCG, postCG, STG) are better predicted by speech embeddings, while higher-level language areas (IFG, pMTG, AG) align with language embeddings. The encoding models show temporal specificity, with performance peaking more than 300ms before word onset during production and 300ms after onset during comprehension, with speech embeddings better predicting activity in perceptual and articulatory areas and language embeddings excelling in high-order language areas.

In summary, the acoustic-to-speech-to-language model offers a unified computational framework for investigating the neural basis of natural language processing. This integrated approach is a paradigm shift toward non-symbolic models based on statistical learning and high-dimensional embedding spaces. As these models evolve to process natural speech better, their alignment with cognitive processes may similarly improve. Some advanced models like GPT-4o incorporate visual modality alongside speech and text, while others integrate embodied articulation systems mimicking human speech production. The fast improvement of these models supports a shift to a unified linguistic paradigm that emphasizes the role of usage-based statistical learning in language acquisition as it is materialized in real-life contexts.


    Check out the Paper, and Google Blog. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    The post A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the Neural Basis of Natural Language Processing in Everyday Conversations appeared first on MarkTechPost.

    Fish AI Reader

    Fish AI Reader

    AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

    FishAI

    FishAI

    鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

    联系邮箱 441953276@qq.com

    相关标签

    语言处理 神经科学 机器学习 Whisper模型 大脑
    相关文章