TechCrunch News 03月21日
OpenAI upgrades its transcription and voice-generating AI models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI将新的转录和语音生成AI模型引入其API,声称改进了之前的版本。这些模型符合其更广泛的“代理”愿景,旨在帮助用户完成任务。新的语音模型更具表现力和可操控性,新的转录模型能更好地捕捉各种语音,且准确性提高,但某些语言的转录仍存在问题,且新转录模型不计划公开提供。

💬OpenAI推出新的转录和语音生成AI模型,改进了之前版本

🎙新语音模型更具表现力和可操控性,开发者可用自然语言指示

📋新转录模型能更好地捕捉各种语音,准确性提高,但某些语言仍有问题

🚫新转录模型不计划公开提供,因其规模较大不适合在本地运行

OpenAI is bringing new transcription and voice-generating AI models to its API that the company claims improve upon its previous releases.

For OpenAI, the models fit into its broader “agentic” vision: building automated systems that can independently accomplish tasks on behalf of users. The definition of “agent” might be in dispute, but OpenAI Head of Product Olivier Godemont described one interpretation as a chatbot that can speak with a businesses’ customers.

“We’re going to see more and more agents pop up in the coming months” Godemont told TechCrunch during a briefing. “And so the general theme is helping customers and developers leverage agents that are useful, available, and accurate.”

OpenAI claims that its new text-to-speech model, “gpt-4o-mini-tts,” not only delivers more nuanced and realistic-sounding speech but is more “steerable” than its previous-gen speech-synthesizing models. Developers can instruct gpt-4o-mini-tts on how to say things in natural language — for example, “speak like a mad scientist” or “use a serene voice, like a mindfulness teacher.”

Here’s a “true crime-style,” weathered voice:

And here’s a sample of a female “professional” voice:

Jeff Haris, a member of the product staff at OpenAI, told TechCrunch that the goal is to let developers tailor both the voice “experience” and “context.”

“In different contexts, you don’t just want a flat, monotonous voice,” Harris continued. “If you’re in a customer support experience and you want the voice to be apologetic because it’s made a mistake, you can actually have the voice have that emotion in it […] Our big belief, here, is that developers and users want to really control not just what is spoken, but how things are spoken.”

As for OpenAI’s new speech-to-text models, “gpt-4o-transcribe” and “gpt-4o-mini-transcribe,” they effectively replace the company’s long-in-the-tooth Whisper transcription model. Trained on “diverse, high-quality audio datasets,” the new models can better capture accented and varied speech, OpenAI claims, even in chaotic environments.

They’re also less likely to hallucinate, Harris added. Whisper notoriously tended to fabricate words — and even whole passages — in conversations, introducing everything from racial commentary to imagined medical treatments into transcripts.

“[T]hese models are much improved versus Whisper on that front,” Harris said. “Making sure the models are accurate is completely essential to getting a reliable voice experience, and accurate [in this context] means that the models are hearing the words precisely [and] aren’t filling in details that they didn’t hear.”

Your mileage may vary depending on the language being transcribed, however.

According to OpenAI’s internal benchmarks, gpt-4o-transcribe, the more accurate of the two transcription models, has a “word error rate” approaching 30% for Indic and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada. That means that the model misses around three out of every 10 words in those languages.

The results from OpenAI’s internal speech recognition benchmarks.Image Credits:OpenAI

In a break from tradition, OpenAI doesn’t plan to make its new transcription models openly available. The company historically released new versions of Whisper for commercial use under an MIT license.

Harris said that gpt-4o-transcribe and gpt-4o-mini-transcribe are “much bigger than Whisper” and thus not good candidates for an open release.

“[T]hey’re not the kind of model that you can just run locally on your laptop, like Whisper,” he continued. “[W]e want to make sure that if we’re releasing things in open source, we’re doing it thoughtfully, and we have a model that’s really honed for that specific need. And we think that end-user devices are one of the most interesting cases for open-source models.”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI 语音生成 转录模型 准确性 非公开提供
相关文章