According to NVIDIA's developer website, you can use GPU to speed up the rendering of the ffmpeg filter.
Create high-performance end-to-end hardware-accelerated video processing, 1:N encoding and 1:N transcoding pipeline using built-in > filters in FFmpeg
Ability to add your own custom high-performance CUDA filters using the shared CUDA context implementation in FFmpeg
The problem I am having now is how to use the GPU to speed up multiple ffmpeg filter processing?
For example:
ffmpeg -loop 1 -i dog.jpg -filter_complex "scale=iw*4:-1,zoompan=z='zoom+0.002':x='iw/2-(iw/zoom/2)':y='ih/2-(ih/zoom/2)':s=720x960" -pix_fmt yuv420p -vcodec libx264 -preset ultrafast -y -r:v 25 -t 5 -crf 28 dog.mp4
When it comes to hardware acceleration in FFmpeg, you can expect the following implementations by type:
1. Hardware-accelerated encoders: In the case of NVIDIA, NVENC is supported and implemented via the h264_nvenc and the hevc_nvenc wrappers. See this answer on how to tune them, and any limitations you may run into depending on the generation of hardware you're on.
2. Hardware-accelerated filters: Filters that perform duties such as scaling and post-processing (deinterlacing, etc) are available in FFmpeg, and some implementations are hardware-accelerated. For NVIDIA, the following filters can take advantage of hardware-acceleration:
(a). scale_cuda: This is a scaling filter analogous to the generic scale filter, implemented in CUDA. It's dependency is the ffnvcodec project, headers needed to also enable the NVENC-based encoders. When the ffnvcodec headers are present, the respective filters dependent on it (scale_cuda and yadif_cuda) will be automatically enabled. In production, it may be wise to deprecate this filter in favor of
scale_npp
as it has a very limited set of options.(b). scale_npp: This is a scaling filter implemented in NVIDIA's Performance Primitives. It's primary dependency is the CUDA SDK, and it must be explicitly enabled by passing
--enable-libnpp
,--enable-cuda-nvcc
and--enable-nonfree
flags to./configure
at compile time when building FFmpeg from source. Use this filter in place ofscale_cuda
wherever possible.(c). yadif_cuda: This is a deinterlacer, implemented in CUDA. It's dependency, as stated above, is the ffnvcodec package of headers.
(d). All OpenCL-based filters: All NVENC-capable GPUs supported by both the mainline NVIDIA driver and the CUDA SDK implement OpenCL support. I started this section with this clarification because there's news in the wind that NVIDIA will be deprecating mobile Kepler GPUs in their mainline driver, relegating them to Legacy support status. For this reason, if you're on such a platform, take this into consideration.
To enable these filters, pass
--enable-opencl
to FFmpeg's./configure
script at build time. Note that this requires the OpenCL headers to be present on your system, and can be safely satisfied by your package manager on whatever Linux distribution you're on. On other operating systems, your mileage may vary.To see all OpenCL-based filters, run:
A few notable examples being
unsharp_opencl
,avgblur_opencl
, etc. See this wiki section for more options.A note pertaining to performance with OpenCL filters: Please take into account any overheads that mechanisms introduced by filter chains such as
hwupload
andhwdownload
may introduce into your pipeline, as uploading textures to and from system memory and the accelerator in question will affect performance, and so will format conversion operations (via theformat
filter) where needed/required. In this case, it may be beneficial to take advantage of thehwmap
filter, and deriving contexts where applicable. For instance, VAAPI has a mechanism that allows for OpenCL device derivation and reverse mapping viahwmap
, if thecl_intel_va_api_media_sharing
OpenCL extension is present. This is typically provided by the Beignet ICD, and is absent in others, such as the newer Neo OpenCL driver.3. Hardware-accelerated decoders (and their associated wrappers): Depending on your input source, and the capabilities of your NVIDIA GPU, based on generation, you may also tap into hardware accelerations based on either CUVID or NVDEC. These methods differ in how they handle textures in-flight on the accelerator, and it is wise to evaluate other factors, such as VRAM utilization, when they are in use. Typically, you can take advantage of the CUVID-based hwaccels for operations such as deinterlacing, if so desired. See their usage via:
However, beware that handling MBAFF encoded content with these decoders, where double deinterlacing is required, is not advisable as NVIDIA has not yet implemented MBAFF support in the backend. Take a look at this thread for more on the same.
In closing: It is wise to evaluate where and when hardware accelerated offloading (filtering, encoding and decoding) offers an advantage or an acceptable trade-off (in quality, feature support and reliability) in your pipeline prior to deployment in production. This is a vendor-neutral approach when deciding what and when to offload parts of your pipeline, and the same applies to NVIDIA's solutions.
For more information, refer to the hardware acceleration entry in FFmpeg's wiki.
Warning: Be sure to lower the decoder's thread count to 1. These hwaccels, particularly cuvid (and the nvdec wrapper) do not implement threading support. Infact, they'll throw warnings at you if the thread count exceeds 16.
Pass
-threads 1
to ffmpeg before input. The argument position of threads is important. In this case, it sets the thread count for the decoder to 1. After the input, it sets the thread count used by FFmpeg's encoders and muxers (if threading is supported) to the configured value.Samples demonstrating the use of hardware-accelerated filtering, encoding and decoding based on the notes above:
1. Demonstrate the use of 1:N encoding with NVENC:
The following assumption is made: The test-bed only has one NVENC-capable GPU present, a simple GTX 1070. For this reason I'm limited to two simultaneous NVENC sessions, and that is taken into account with the snippets below. Be warned that cases needing to utilize multiple NVENC-capable GPUs will need the command line(s) modified as appropriate.
My sample files are in
~/Desktop/src
I'll be working with a sample file as shown below:
With that information, we can tell that the input file is deinterlaced, encoded at 59.94 FPS. In the examples below, I'll target the same frame rate, using a closed GOP, assuming a fixed keyframe distance of 2 seconds (set by
-g 120
where-r=60
).I can run this encoder sample as shown, demonstrating two use cases:
2. Use the nvdec hwaccel paired with the yadif_cuda deinterlacer:
You can use an extra filter before the
yadif_cuda
deinterlacer,hwupload_cuda
in cases where hardware accelerated decode is undesirable. When you call up thehwupload_cuda
filter, it automatically creates a device type cuda, converts all in-flight textures to the cuda format and uploads them to the shared CUDA hardware context from which the latter filteryadif_cuda
can operate on. However, if you pass the option-hwaccel_output_format cuda
you can skip this extrahwupload_cuda
filter. This is the preferred method for maximum throughput.The options specified for the
yadif_cuda
filter are:(a). Set the deinterlaing mode as send one frame for each frame.
(b). Set the assumed picture type parity as automatic.
(c). To only deinterlace frames marked as deinterlaced.
You can confirm this by running:
You can also attempt double de-interlacing (wherein the de-interlacer sends one frame per field, instead of one frame per frame) by applying the deinterlacer options below.(see the filter options passed in
yadif_cuda=1:-1:1
):However, be cautious with this option as it may fail at some specific frame rates. In my testing, using NTSC interlaced content at 29.970 FPS resulted in failure when attempting a double deinterlace. Your mileage may vary.
3. Demonstrating the use of an OpenCL filter with the NVIDIA GPU:
The filter we will use in this case is the
tonemap_opencl
, with the following usage options:The sample file in use has HDR metadata embedded, and using the NVENC encoders, will be encoded to a pair of outputs with tone-mapping applied. The sample file used is from this URL.
From ffprobe:
Now let us apply the
tonemap_opencl
filter to the previous command, switching to the new input file, and timing the command:According to FFmpeg, that took:
For more on tone-mapping, see this excellent write-up.
You will need to compile your own ffmpeg build using their extensions - see https://developer.nvidia.com/ffmpeg for instructions as the standard binary does not include these capabilities.
Possible solution. Untested, so let me know of any errors...
Where you...
Initialize NVIDIA encoding by
-hwaccel NVENC
.Set codec as
-vcodec h264_nvenc
.