I need to perform 'Inter-frame Prediction' and 'Motion Compensation' of a set of 30 frames for video processing in Matlab. I am working with Mother-daughter frames.
What I have done so far is to take the very first frame and divided it into
- 8x8 blocks
- performed DCT
- quantized it
- dequantized it
- performed inverse DCT.
I know that No motion estimation is required for the first frame and second frame onwards, the reconstructed frame one is used as reference for frame two and so on. For motion estimation I need to implement 'Full-search Block Matching Algorithm'
Question 1: What is meant by reconstruction of a frame? Is it quantization and DCT which I have listed above?
Question 2: What is 'Full-search Block Matching Algorithm'?
I'm going to assume that you are referring to the MPEG consortium of video compression algorithms (MPEG-1, MPEG-2, H.264, etc.). Let's answer each question one at a time:
Question #1 - Frame Reconstruction
For a single frame, the forward transformation basically consists of decomposing a frame into 8 x 8 non-overlapping blocks, doing an 8 x 8 DCT transform of each block, quantizing the blocks, and then we perform some more complicated stuff such as zig-zag ordering, run-length encoding, etc.
Basically, your frame is represented as a compressed sequence of bits. A reconstruction of the frame is going in the reverse order, so you almost have it right. This consists of reconstructing the sequence and undoing the zig-zag ordering, then de-quantizing the block, then applying the IDCT. The reason why they call this "reconstruction" is because you represented the frame to be in a different format. You are converting the frame back to what it should have been before compressing the frame.
One thing that you may already know is that quantization of the frame is the reason why this methodology is lossy. This means that you won't be able to get the original frame back, but you can get it to be as close as possible to the original. However, the advantage is that with lossy algorithms, you get high compression ratios, which means that the size of the video will be smaller, and can easily be transmitted.
In fact, if you do a forward transformation of one frame, then do a reverse transformation. If you compare the frames pixel by pixel, you will see that there are some subtle differences, but not enough to write home about. The parameters and design behind how the compression works has been tuned so that the human visual system of an average person won't be able to notice much of the differences between the original and the reconstructed frame in hindsight.
So why lossy you may ask? The reason why this is is because the MPEG consortium leveraged that the video should be highly compressible and transmittable in favour of the actual quality of the video. This is due to the fact that quality has always been a subjective measure, even when you have numerical measures (PSNR for instance) that can measure image quality.
So, the moral of this story is that a reconstruction is undoing the forward transformation performed to get the video frame to be compressed, but it will not exactly be the same as the original frame, but close enough that a normal human being won't complain.
Question #2 - Full-search Block Matching Algorithm
The basics behind motion estimation are that we don't want to transmit every frame as full video frames in order to reduce transmission bandwidth. If you know the basics of the MPEG consortium of video compression algorithms, there are three classes of encoded frames in your video:
I-Frames - These are what are known as intracoded frames. These frames have the full compression algorithm performed on them (DCT, Quantization, etc.). We don't have a video that consists entirely of I-Frames as that would make the size of the video quite large. Instead, what is done is that I-frames are used as a reference point, and difference frames are sent after this point where for each block in an I-Frame, a motion vector is transmitted. More to follow.
P-Frames - Instead of sending another I-Frame, we send a predicted frame or P-Frame instead. For each block from a reference I-Frame, the P-Frame essentially tells us where the block best moved from one frame to the next. These are what are known as motion vectors for each block. The rationale behind this is that video is usually captured at such a high frame rate, that successive video frames exhibit very little difference and so most of the blocks should remain the same, or move very little. You will get to a point where the scene will drastically change in the video, or that there is a lot of high motion that even with a high frame rate, you can't adequately capture all of the motion only with P-Frames. This is commonly seen when you're watching MPEG video and there is a lot of high motion - you'll see a lot of "blockiness", and that blockiness is explained by this fact. As such, you'll need to encode another I-Frame as a quick refresher and then continue from this point. As such, most video files have the frames encoded such that you have one I-Frame, then have a bunch of P-frames, then have another I-Frame followed by a bunch of P-Frames and so on.
B-Frames - These are what are known as bi-directional predicted frames. These frames use information from both the frame (or frames) that are ahead and the frame (or frames) from behind. How these exactly work are beyond the scope of this post, but I wanted to talk about this briefly to be self-contained.
As such, one possible sequence of frames that are encoded follow the following format:
IPPPBPPPIPPPBPPPI...
However, this all depends on how your encoder is set up, but we'll leave that aside.
How is all of this useful you might ask? The reason why is because your question of Full-search Block Matching Algorithm deals exactly with how P-frames are constructed. For a given block in an I-Frame, where would the best location that this block would have moved to in the next frame? To do this, we actually take a look at blocks in the next frame and figure out the most similar block with the one in the I-Frame. You are probably asking yourself this question: Woah.... aren't there a lot of blocks to search for? and the answer is yes. The Full-search Block Matching algorithm basically searches the entire frame for the best matching block. This is quite computationally intensive, and so most encoders actually limit the search to moderately sized finite window around the block's location. Full-search Block Matching would give you the best results, but takes too long, and definitely not worth it. We can leverage the fact that most blocks don't really move that far as we're assuming the video was captured with such a high frame rate.
I hope this has answered your questions!