使用 Whisper 生成视频字幕：从提取音频到批量处理

生成视频字幕是许多视频处理任务的核心需求。本文将指导你使用 OpenAI 的 Whisper 模型为视频文件（如电视剧《Normal People》或电影《花样年华》）生成字幕（SRT 格式）。我们将从提取音频开始，逐步实现字幕生成，并提供一个 Python 脚本实现批量处理。此外，我们还将探讨如何处理非英语音频（如中文）并优化字幕质量。

前提条件

在开始之前，请确保安装以下工具：

1. FFmpeg：用于从视频提取音频。

安装

brew install ffmpeg

sudo apt-get install ffmpeg

sudo dnf install ffmpeg

2. Python 3.8+ ：用于运行脚本和 Whisper。

安装 Python：python.org。

3. Whisper：OpenAI 的语音转文字模型。

pip install openai-whisper

4. uv（可选） ：用于管理 Python 项目环境。

pip install uv

5. 视频文件：准备 MP4 或 MKV 格式的视频文件（如《Normal People》或《花样年华》）。

步骤 1：提取音频

第一步是从视频文件中提取音频。我们使用 FFmpeg 将视频的音频流保存为 AAC 格式。

示例命令

为《Normal People》第1季第1集提取音频：

ffmpeg -i /path/to/Normal.People.S01E01.mp4 -vn -acodec copy /path/to/audio/Normal.People.S01E01.aac

-i

-vn

-acodec copy

/path/to/audio/Normal.People.S01E01.aac

注意事项

/path/to/audio/

/path/to/

步骤 2：生成字幕

使用 Whisper 模型将音频文件转换为 SRT 格式的字幕文件。Whisper 支持多种模型（如 tiny、base、small、medium、large 和 turbo），turbo 速度快，适合快速测试。

示例命令

为提取的音频生成字幕：

whisper /path/to/audio/Normal.People.S01E01.aac --model turbo --output_format srt --output_dir /path/to/generated_subs/

--model turbo

--output_format srt

--output_dir

/path/to/generated_subs/Normal.People.S01E01.srt

示例输出

生成的前几条字幕可能如下：

1  00:00:00,000 --> 00:00:24,000  It's a simple game. You have 15 players. Give one of them the ball.  Get it into the net.  2  00:00:24,000 --> 00:00:26,000  Very simple. Isn't it?

步骤 3：批量处理脚本

手动为多个视频生成字幕效率低下。以下 Python 脚本自动处理目录中的所有视频文件，提取音频并生成字幕。

完整脚本

import os  import subprocess  import argparse  defextract_audio(input_dir, output_dir):      """Extract audio from video files in input_dir and save to output_dir."""    ifnot os.path.exists(output_dir):          os.makedirs(output_dir)      for filename in os.listdir(input_dir):          if filename.endswith(('.mp4', '.mkv')):              input_path = os.path.join(input_dir, filename)              audio_filename = os.path.splitext(filename)[0] + '.aac'            output_path = os.path.join(output_dir, audio_filename)              command = [                  'ffmpeg', '-i', input_path, '-vn', '-acodec', 'copy', output_path              ]              print(f"Extracting audio: {command}")              try:                  subprocess.run(command, check=True)              except subprocess.CalledProcessError as e:                  print(f"Error extracting audio from {filename}: {e}")  defgenerate_subtitles(input_dir, output_dir):      """Generate subtitles for audio files using Whisper."""    ifnot os.path.exists(output_dir):          os.makedirs(output_dir)      for filename in os.listdir(input_dir):          if filename.endswith('.aac'):              input_path = os.path.join(input_dir, filename)              command = [                  'whisper', input_path, '--model', 'turbo',                  '--output_format', 'srt', '--output_dir', output_dir              ]              print(f"Generating subtitles: {command}")              try:                  subprocess.run(command, check=True)              except subprocess.CalledProcessError as e:                  print(f"Error generating subtitles for {filename}: {e}")  if __name__ == "__main__":      parser = argparse.ArgumentParser(description="Extract audio and generate subtitles.")      parser.add_argument("input_dir", help="Directory containing video files.")      parser.add_argument("audio_dir", help="Directory to save extracted audio files.")      parser.add_argument("subtitle_dir", help="Directory to save generated subtitles.")      args = parser.parse_args()      extract_audio(args.input_dir, args.audio_dir)      generate_subtitles(args.audio_dir, args.subtitle_dir)

使用方法

generate_subtitles.py

python generate_subtitles.py /path/to/videos /path/to/audio /path/to/generated_subs

步骤 4：优化字幕质量

生成的字幕可能存在以下问题，我们提供优化方法：

问题 1：时间戳不准确

解决方法：

--max_line_width 50

--max_line_count 2

import pysrt  subs = pysrt.open('subtitles.srt')  for sub in subs:      if sub.start.seconds < 18:          sub.shift(seconds=18)  subs.save('adjusted_subtitles.srt')

问题 2：字幕过长

解决方法：

使用 NLTK 分句（示例代码）：

import nltk  nltk.download('punkt')  from nltk.tokenize import sent_tokenize  def split_long_subtitle(text):      return sent_tokenize(text)  long_text = "It's a simple game. You have 15 players. Give one of them the ball."  sentences = split_long_subtitle(long_text)  # 输出：['It's a simple game.', 'You have 15 players.', ...]

问题 3：标点不一致

解决方法：

--append_punctuations ".,!?"

import spacy  nlp = spacy.load("en_core_web_sm")  text = "It's a simple game You have 15 players"  doc = nlp(text)  punctuated_text = " ".join(token.text_with_ws for token in doc)  # 输出：It's a simple game. You have 15 players.

步骤 5：处理非英语音频（如中文）

示例命令

生成中文字幕并翻译为英文：

whisper /path/to/In.the.Mood.for.Love.mp4 --model large --output_format srt --output_dir /path/to/generated_subs --language zh --task transcribe

优化建议

使用 large 模型

指定方言

--language yue

预处理音频

ffmpeg -i input.mp4 -af "afftdn" -vn -acodec copy output.aac

注意事项

性能考虑

文件格式

调试

--verbose

总结

通过 FFmpeg 和 Whisper，可以轻松为视频生成高质量字幕。批量处理脚本自动化了提取音频和生成字幕的过程，优化时间戳、字幕长度和标点的方法进一步提升了字幕质量。对于非英语音频（如中文），使用 large 模型、预处理音频和分离转录翻译是关键。

前提条件

步骤 1：提取音频

示例命令

注意事项

步骤 2：生成字幕

示例命令

示例输出

步骤 3：批量处理脚本

完整脚本

使用方法

步骤 4：优化字幕质量

问题 1：时间戳不准确

问题 2：字幕过长

问题 3：标点不一致

步骤 5：处理非英语音频（如中文）

示例命令

优化建议

注意事项

总结

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签