MarkTechPost@AI 2024年08月11日
Parler-TTS Released: A Fully Open-Sourced Text-to-Speech Model with Advanced Speech Synthesis for Complex and Lightweight Applications
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Parler-TTS是一个完全开源的文本转语音(TTS)库,提供两种强大的模型:Parler-TTS Large v1 和 Parler-TTS Mini v1。这两个模型都经过了45,000小时的音频数据训练,能够生成高质量、自然的声音,并对各种功能进行显著控制。用户可以通过简单的文本提示来操作性别、背景噪音、语速、音调和混响等方面,在语音生成方面提供了前所未有的灵活性。

😊Parler-TTS Large v1模型拥有22亿个参数,使其成为复杂语音合成任务的强大工具。另一方面,Parler-TTS Mini v1作为轻量级替代方案,以更紧凑的形式提供类似的功能。这两个模型都是更广泛的Parler-TTS项目的一部分,该项目旨在为社区提供全面的TTS训练资源和数据集预处理代码,促进语音合成领域的创新和发展。

🤩Parler-TTS模型的一个突出特点是它们能够确保跨代的说话者一致性。这些模型是在34个不同的说话者上训练的,每个说话者都有名字(例如,Jon、Lea、Gary、Jenna、Mike、Laura)。此功能允许用户在文本描述中指定特定说话者,从而能够在多个实例中生成一致的语音输出。例如,用户可以创建类似“Jon的声音单调但略快”的描述,以保持特定说话者的特征。

🥳Parler-TTS项目因其对开源原则的承诺而脱颖而出。所有数据集、预处理工具、训练代码和模型权重都在宽松许可下公开发布。这种方法使社区能够在工作基础上进行构建和扩展,从而促进更强大的TTS模型的开发。该项目的生态系统包括用于模型训练和微调的Parler-TTS存储库、用于数据集注释的Data-Speech存储库以及用于访问注释数据集和未来检查点的Parler-TTS组织。

🥳为了优化生成的语音的质量和特征,Parler-TTS为用户提供了一些有用的提示。一项关键技术是在文本描述中包含特定术语以控制音频清晰度。例如,包含“非常清晰的音频”短语将提示模型生成最高质量的音频输出。相反,使用“非常嘈杂的音频”将引入更高水平的背景噪音,在需要时允许更丰富和逼真的语音环境。

🥳标点符号在控制生成的语音的韵律方面起着至关重要的作用。用户可以利用此功能为输出添加细微差别和自然的停顿。例如,在输入文本中策略性地放置逗号将导致生成的语音中出现小的停顿,模仿人类对话的自然节奏和流动。这种简单但有效的方法允许对生成的音频的速度和重点进行更大的控制。

🥳剩余的语音特征,如性别、语速、音调和混响,可以通过文本提示直接操作。这种控制级别允许用户微调生成的语音以匹配特定需求或偏好。通过仔细设计输入描述,用户可以实现各种语音特征,从缓慢、深沉的男性声音到快速、高音调的女性声音,以及不同程度的混响以模拟不同的声学环境。

Parler-TTS has emerged as a robust text-to-speech (TTS) library, offering two powerful models: Parler-TTS Large v1 and Parler-TTS Mini v1. Both models are trained on an impressive 45,000 hours of audio data, enabling them to generate high-quality, natural-sounding speech with remarkable control over various features. Users can manipulate aspects such as gender, background noise, speaking rate, pitch, and reverberation through simple text prompts, providing unprecedented flexibility in speech generation.

The Parler-TTS Large v1 model boasts 2.2 billion parameters, making it a formidable tool for complex speech synthesis tasks. On the other hand, Parler-TTS Mini v1 serves as a lightweight alternative, offering similar capabilities in a more compact form. Both models are part of the broader Parler-TTS project, which aims to provide the community with comprehensive TTS training resources and dataset pre-processing code, fostering innovation and development in the field of speech synthesis.

One of the standout features of both Parler-TTS models is their ability to ensure speaker consistency across generations. The models have been trained on 34 distinct speakers, each characterized by name (e.g., Jon, Lea, Gary, Jenna, Mike, Laura). This feature allows users to specify a particular speaker in their text descriptions, enabling the generation of consistent voice outputs across multiple instances. For example, users can create a description like “Jon’s voice is monotone yet slightly fast in delivery” to maintain a specific speaker’s characteristics.

The Parler-TTS project stands out from other TTS models due to its commitment to open-source principles. All datasets, pre-processing tools, training code, and model weights are released publicly under permissive licenses. This approach enables the community to build upon and extend the work, fostering the development of even more powerful TTS models. The project’s ecosystem includes the Parler-TTS repository for model training and fine-tuning, the Data-Speech repository for dataset annotation, and the Parler-TTS organization for accessing annotated datasets and future checkpoints.

To optimize the quality and characteristics of generated speech, Parler-TTS offers several useful tips for users. One key technique is to include specific terms in the text description to control audio clarity. For instance, incorporating the phrase “very clear audio” will prompt the model to generate the highest quality audio output. Conversely, using “very noisy audio” will introduce higher levels of background noise, allowing for more diverse and realistic speech environments when needed.

Punctuation plays a crucial role in controlling the prosody of generated speech. Users can utilize this feature to add nuance and natural pauses to the output. For example, strategically placing commas in the input text will result in small breaks in the generated speech, mimicking the natural rhythm and flow of human conversation. This simple yet effective method allows for greater control over the pacing and emphasis of the generated audio.

The remaining speech features, such as gender, speaking rate, pitch, and reverberation, can be directly manipulated through the text prompt. This level of control allows users to fine-tune the generated speech to match specific requirements or preferences. By carefully crafting the input description, users can achieve a wide range of voice characteristics, from a slow, deep masculine voice to a rapid, high-pitched feminine one, with varying degrees of reverberation to simulate different acoustic environments.


Parler-TTS emerges as a cutting-edge text-to-speech library, featuring two models: Large v1 and Mini v1. Trained on 45,000 hours of audio, these models generate high-quality speech with controllable features. The library offers speaker consistency across 34 voices and embraces open-source principles, fostering community innovation. Users can optimize output by specifying audio clarity, using punctuation for prosody control, and manipulating speech characteristics through text prompts. With its comprehensive ecosystem and user-friendly approach, Parler-TTS represents a significant advancement in speech synthesis technology, providing powerful tools for both complex tasks and lightweight applications.


Check out the GitHub and Demo. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here


The post Parler-TTS Released: A Fully Open-Sourced Text-to-Speech Model with Advanced Speech Synthesis for Complex and Lightweight Applications appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Parler-TTS 文本转语音 开源 语音合成
相关文章