掘金 人工智能 前天 09:38
whisper 命令行解析【2】
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文详细介绍了Whisper命令行工具的使用方法,包括模型选择、设备配置、输出格式设置等。通过阅读本文,用户可以了解如何利用该工具进行语音转录和翻译,并根据需求调整各种参数以获得最佳效果。文章还涵盖了高级功能,如时间戳、关键词高亮等,帮助用户更好地处理音频文件。

🗣️ **模型选择**:使用`--model`参数指定Whisper模型,如`--model large`,可选择不同大小的模型以平衡准确性和速度。

⚙️ **设备配置**:通过`--device`参数设置运行设备的类型,例如`--device cuda`,以便在GPU上进行加速。

💾 **输出格式**:利用`--output_format`参数选择输出格式,如`--output_format srt`,支持多种格式,包括文本、字幕等。

🗣️ **任务选择**:使用`--task`参数选择转录或翻译任务,例如`--task translate`,将音频翻译成英文。

⏱️ **时间戳**:启用`--word_timestamps`参数,可以获取单词级别的时间戳,方便编辑和校对。

1.命令行全文

(pp2) livingbody@192 workspace % whisper --helpusage: whisper [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE] [--output_dir OUTPUT_DIR]               [--output_format {txt,vtt,srt,tsv,json,all}] [--verbose VERBOSE]               [--task {transcribe,translate}]               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]               [--temperature TEMPERATURE] [--best_of BEST_OF] [--beam_size BEAM_SIZE]               [--patience PATIENCE] [--length_penalty LENGTH_PENALTY]               [--suppress_tokens SUPPRESS_TOKENS] [--initial_prompt INITIAL_PROMPT]               [--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT] [--fp16 FP16]               [--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]               [--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD]               [--logprob_threshold LOGPROB_THRESHOLD] [--no_speech_threshold NO_SPEECH_THRESHOLD]               [--word_timestamps WORD_TIMESTAMPS] [--prepend_punctuations PREPEND_PUNCTUATIONS]               [--append_punctuations APPEND_PUNCTUATIONS] [--highlight_words HIGHLIGHT_WORDS]               [--max_line_width MAX_LINE_WIDTH] [--max_line_count MAX_LINE_COUNT]               [--max_words_per_line MAX_WORDS_PER_LINE] [--threads THREADS]               [--clip_timestamps CLIP_TIMESTAMPS]               [--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD]               audio [audio ...]  positional arguments:  audio                 audio file(s) to transcribe  optional arguments:  -h, --help            show this help message and exit  --model MODEL         name of the Whisper model to use (default: turbo)  --model_dir MODEL_DIR                        the path to save model files; uses ~/.cache/whisper by default (default: None)  --device DEVICE       device to use for PyTorch inference (default: cpu)  --output_dir OUTPUT_DIR, -o OUTPUT_DIR                        directory to save the outputs (default: .)  --output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}                        format of the output file; if not specified, all available formats will be                        produced (default: all)  --verbose VERBOSE     whether to print out the progress and debug messages (default: True)  --task {transcribe,translate}                        whether to perform X->X speech recognition ('transcribe') or X->English                        translation ('translate') (default: transcribe)  --language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}                        language spoken in the audio, specify None to perform language detection                        (default: None)  --temperature TEMPERATURE                        temperature to use for sampling (default: 0)  --best_of BEST_OF     number of candidates when sampling with non-zero temperature (default: 5)  --beam_size BEAM_SIZE                        number of beams in beam search, only applicable when temperature is zero                        (default: 5)  --patience PATIENCE   optional patience value to use in beam decoding, as in                        https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to                        conventional beam search (default: None)  --length_penalty LENGTH_PENALTY                        optional token length penalty coefficient (alpha) as in                        https://arxiv.org/abs/1609.08144, uses simple length normalization by default                        (default: None)  --suppress_tokens SUPPRESS_TOKENS                        comma-separated list of token ids to suppress during sampling; '-1' will                        suppress most special characters except common punctuations (default: -1)  --initial_prompt INITIAL_PROMPT                        optional text to provide as a prompt for the first window. (default: None)  --condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT                        if True, provide the previous output of the model as a prompt for the next                        window; disabling may make the text inconsistent across windows, but the model                        becomes less prone to getting stuck in a failure loop (default: True)  --fp16 FP16           whether to perform inference in fp16; True by default (default: True)  --temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK                        temperature to increase when falling back when the decoding fails to meet either                        of the thresholds below (default: 0.2)  --compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD                        if the gzip compression ratio is higher than this value, treat the decoding as                        failed (default: 2.4)  --logprob_threshold LOGPROB_THRESHOLD                        if the average log probability is lower than this value, treat the decoding as                        failed (default: -1.0)  --no_speech_threshold NO_SPEECH_THRESHOLD                        if the probability of the <|nospeech|> token is higher than this value AND the                        decoding has failed due to `logprob_threshold`, consider the segment as silence                        (default: 0.6)  --word_timestamps WORD_TIMESTAMPS                        (experimental) extract word-level timestamps and refine the results based on                        them (default: False)  --prepend_punctuations PREPEND_PUNCTUATIONS                        if word_timestamps is True, merge these punctuation symbols with the next word                        (default: "'“¿([{-)  --append_punctuations APPEND_PUNCTUATIONS                        if word_timestamps is True, merge these punctuation symbols with the previous                        word (default: "'.。,,!!??::”)]}、)  --highlight_words HIGHLIGHT_WORDS                        (requires --word_timestamps True) underline each word as it is spoken in srt and                        vtt (default: False)  --max_line_width MAX_LINE_WIDTH                        (requires --word_timestamps True) the maximum number of characters in a line                        before breaking the line (default: None)  --max_line_count MAX_LINE_COUNT                        (requires --word_timestamps True) the maximum number of lines in a segment                        (default: None)  --max_words_per_line MAX_WORDS_PER_LINE                        (requires --word_timestamps True, no effect with --max_line_width) the maximum                        number of words in a segment (default: None)  --threads THREADS     number of threads used by torch for CPU inference; supercedes                        MKL_NUM_THREADS/OMP_NUM_THREADS (default: 0)  --clip_timestamps CLIP_TIMESTAMPS                        comma-separated list start,end,start,end,... timestamps (in seconds) of clips to                        process, where the last end timestamp defaults to the end of the file (default:                        0)  --hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD                        (requires --word_timestamps True) skip silent periods longer than this threshold                        (in seconds) when a possible hallucination is detected (default: None)

2.命令行解析

2.1 模型相关参数

通过查看帮助,其中模型相关的有三项,分别是模型名称、模型路径、运行模型的设备,具体如下:

                        the path to save model files; uses ~/.cache/whisper by default (default: None)

2.2 输入输出相关

主要有输出路径、输出格式、打印进度、日志信息3种,具体如下:

                        directory to save the outputs (default: .)

                        format of the output file; if not specified, all available formats will be

                        produced (default: all)

2.3 任务相关

内容比较多,主要有任务类型,例如是语音转文本,还是翻译等等,此外还有语言设定、采样参数等。

                        whether to perform X->X speech recognition ('transcribe') or X->English

                        translation ('translate') (default: transcribe)

                        language spoken in the audio, specify None to perform language detection

                        (default: None)

                        temperature to use for sampling (default: 0)

                        number of beams in beam search, only applicable when temperature is zero

                        (default: 5)

                        arxiv.org/abs/2204.05…, the default (1.0) is equivalent to

                        conventional beam search (default: None)

                        optional token length penalty coefficient (alpha) as in

                        arxiv.org/abs/1609.08…, uses simple length normalization by default

                        (default: None)

                        comma-separated list of token ids to suppress during sampling; '-1' will

                        suppress most special characters except common punctuations (default: -1)

                        optional text to provide as a prompt for the first window. (default: None)

                        if True, provide the previous output of the model as a prompt for the next

                        window; disabling may make the text inconsistent across windows, but the model

                        becomes less prone to getting stuck in a failure loop (default: True)

                        temperature to increase when falling back when the decoding fails to meet either

                        of the thresholds below (default: 0.2)

                        if the gzip compression ratio is higher than this value, treat the decoding as

                        failed (default: 2.4)

                        if the average log probability is lower than this value, treat the decoding as

                        failed (default: -1.0)

                        if the probability of the <|nospeech|> token is higher than this value AND the

                        decoding has failed due to logprob_threshold, consider the segment as silence

                        (default: 0.6)

                        (experimental) extract word-level timestamps and refine the results based on

                        them (default: False)

                        if word_timestamps is True, merge these punctuation symbols with the next word

                        (default: "'“¿([{-)

                        if word_timestamps is True, merge these punctuation symbols with the previous

                        word (default: "'.。,,!!??::”)]}、)

                        (requires --word_timestamps True) underline each word as it is spoken in srt and

                        vtt (default: False)

                        (requires --word_timestamps True) the maximum number of characters in a line

                        before breaking the line (default: None)

                        (requires --word_timestamps True) the maximum number of lines in a segment

                        (default: None)

                        (requires --word_timestamps True, no effect with --max_line_width) the maximum

                        number of words in a segment (default: None)

                        MKL_NUM_THREADS/OMP_NUM_THREADS (default: 0)

                        comma-separated list start,end,start,end,... timestamps (in seconds) of clips to

                        process, where the last end timestamp defaults to the end of the file (default:

                        0)

                        (requires --word_timestamps True) skip silent periods longer than this threshold

                        (in seconds) when a possible hallucination is detected (default: None)

3.运行demo

3.1 中文语音识别

命令行基本格式如下,我们试试在mac下的效果。

whisper /path/to/audio/file --model /path/to/custom/model --language Chinesewhisper zh.wav --model tiny.pt --language Chinese

输出:

(pp2) livingbody@192 sound4 % whisper zh.wav --model tiny.pt              /Users/livingbody/miniconda3/envs/pp2/lib/python3.9/site-packages/whisper/transcribe.py:126: UserWarning: FP16 is not supported on CPU; using FP32 instead  warnings.warn("FP16 is not supported on CPU; using FP32 instead")Detecting language using up to the first 30 seconds. Use `--language` to specify the languageDetected language: Chinese[00:00.000 --> 00:04.480] 我認為跑步最重要的就是給我帶來了身體健康

可见,tiny模型对简体中文支持不太好,如果用的话还需要进一步转换。

3.2 英文语音识别

whisper /path/to/audio/file --model /path/to/custom/model --language Chinesewhisper zh.wav --model tiny.pt --language Chinese

输出:

(pp2) livingbody@192 sound4 % whisper en.wav --model tiny.pt /Users/livingbody/miniconda3/envs/pp2/lib/python3.9/site-packages/whisper/transcribe.py:126: UserWarning: FP16 is not supported on CPU; using FP32 instead  warnings.warn("FP16 is not supported on CPU; using FP32 instead")Detecting language using up to the first 30 seconds. Use `--language` to specify the languageDetected language: English[00:00.000 --> 00:03.000]  I knocked at the door on the ancient side of the building.

英文输出还是比比较精准,使用的tiny模型大概70余兆,比较小,很适合树莓派等设备部署。

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Whisper 命令行工具 语音转录 语音翻译
相关文章