@ -7682,6 +7682,113 @@ There are 6 samples at -4 dB, 62 at -5 dB, 286 at -6 dB, etc. In other words, raising the volume by +4 dB does not cause any clipping, raising it by +5 dB causes clipping for 6 samples, etc. @anchor{whisper} @section whisper It runs automatic speech recognition using the OpenAI's Whisper model. It requires the whisper.cpp library (https://github.com/ggml-org/whisper.cpp) as a prerequisite. After installing the library it can be enabled using: @code{./configure --enable-whisper}. The filter has following options: @table @option @item model The file path of the downloaded whisper.cpp model (mandatory). @item language The language to use for transcription ('auto' for auto-detect). Default value: @code{"auto"} @item queue The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: @code{"3"} @item use_gpu If the GPU support should be enabled. Default value: @code{"true"} @item gpu_device The GPU device index to use. Default value: @code{"0"} @item destination If set, the transcription output will be sent to the specified file or URL (use one of the FFmpeg AVIO protocols); otherwise, the output will be logged as info messages. The output will also be set in the "lavfi.whisper.text" frame metadata. If the destination is a file and it already exists, it will be overwritten. @item format The destination format string; it could be "text" (only the transcribed text will be sent to the destination), "srt" (subtitle format) or "json". Default value: @code{"text"} @item vad_model Path to the VAD model file. If set, the filter will load an additional voice activity detection module (https://github.com/snakers4/silero-vad) that will be used to fragment the audio queue; use this option setting a valid path obtained from the whisper.cpp repository (e.g. "../whisper.cpp/models/ggml-silero-v5.1.2.bin") and increase the queue parameter to a higher value (e.g. 20). @item vad_threshold The VAD threshold to use. Default value: @code{"0.5"} @item vad_min_speech_duration The minimum VAD speaking duration. Default value: @code{"0.1"} @item vad_min_silence_duration The minimum VAD silence duration. Default value: @code{"0.5"} @end table @subsection Examples @itemize @item Run a transcription with srt file generation: @example ffmpeg -i input.mp4 -vn -af "whisper=model=../whisper.cpp/models/ggml-base.en.bin\ :language=en\ :queue=3\ :destination=output.srt\ :format=srt" -f null - @end example @item Run a transcription and send the output in JSON format to an HTTP service: @example ffmpeg -i input.mp4 -vn -af "whisper=model=../whisper.cpp/models/ggml-base.en.bin\ :language=en\ :queue=3\ :destination=http\\://localhost\\:3000\ :format=json' -f null - @end example @item Transcribe the microphone input using the VAD option: @example ffmpeg -loglevel warning -f pulse -i default \ -af 'highpass=f=200,lowpass=f=3000,whisper=model=../whisper.cpp/models/ggml-medium.bin\ :language=en\ :queue=10\ :destination=-\ :format=json\ :vad_model=../whisper.cpp/models/ggml-silero-v5.1.2.bin' -f null - @end example @end itemize @c man end AUDIO FILTERS @chapter Audio Sources