Why sliced thread affect so much on realtime encod

I'm using ffmpeg libx264 to encode a 720p screen captured from x11 in realtime with a fps of 30. when I use -tune zerolatency paramenter, the average encode time per-frame can be as large as 12ms with profile baseline.

After a study of the ffmpeg x264 source code, I found that the key parameter leading to such long encode time is sliced-threads which enabled by -tune zerolatency. After disabled using -x264-params sliced-threads=0 the encode time can be as low as 2ms

And with sliced-threads disabled, the CPU usage will be 40%, while only 20% when enabled.

Can someone explain the details about this sliced-thread? Especially in realtime encoding(assume no frame is buffered to be encoded. only encode when a frame is captured).

The documentation shows that frame-based threading has better throughput than slice-based. It also notes that the latter doesn't scale well due to parts of the encoder that are serial.

Speedup vs. encoding threads for the veryfast profile (non-realtime):

threads  speedup       psnr
      slice frame   slice  frame
x264 --preset veryfast --tune psnr --crf 30
 1:   1.00x 1.00x  +0.000 +0.000
 2:   1.41x 2.29x  -0.005 -0.002
 3:   1.70x 3.65x  -0.035 +0.000
 4:   1.96x 3.97x  -0.029 -0.001
 5:   2.10x 3.98x  -0.047 -0.002
 6:   2.29x 3.97x  -0.060 +0.001
 7:   2.36x 3.98x  -0.057 -0.001
 8:   2.43x 3.98x  -0.067 -0.001
 9:         3.96x         +0.000
10:         3.99x         +0.000
11:         4.00x         +0.001
12:         4.00x         +0.001

The main difference seems to be that frame threading adds frame latency as is needs different frames to work on, while in the case of slice-based threading all threads work on the same frame. In realtime encoding it would need to wait for more frames to arrive to fill the pipeline as opposed to offline.

Normal threading, also known as frame-based threading, uses a clever staggered-frame system for parallelism. But it comes at a cost: as mentioned earlier, every extra thread requires one more frame of latency. Slice-based threading has no such issue: every frame is split into slices, each slice encoded on one core, and then the result slapped together to make the final frame. Its maximum efficiency is much lower for a variety of reasons, but it allows at least some parallelism without an increase in latency.

From: Diary of an x264 Developer

Sliceless threading: example with 2 threads. Start encoding frame #0. When it's half done, start encoding frame #1. Thread #1 now only has access to the top half of its reference frame, since the rest hasn't been encoded yet. So it has to restrict the motion search range. But that's probably ok (unless you use lots of threads on a small frame), since it's pretty rare to have such long vertical motion vectors. After a little while, both threads have encoded one row of macroblocks, so thread #1 still gets to use motion range = +/- 1/2 frame height. Later yet, thread #0 finishes frame #0, and moves on to frame #2. Thread #0 now gets motion restrictions, and thread #1 is unrestricted.

From: http://web.archive.org/web/20150307123140/http://akuvian.org/src/x264/sliceless_threads.txt

Therefore it makes sense to enable sliced-threads with -tune zereolatency as you need to send a frame as soon as possible rather then encode them efficiently (performance and quality wise).

Using too many threads on the contrary can impact performance as the overhead to maintain them can exceed the potential gains.