I doubt in one thing. As described in ISO-14496-12 moov/mvhd/trak/mdia/minf/stbl/stsd should contain format specific box e.g. avc1 box described in ISO-14496-15 or mp42 described in ISO-14496-14. But it also contains fields in VideoSampleDescription from QuickTime Format specification such as 'version', 'revision_level','vendor', etc.
Could anyone explain this issue?
The stsd (Sample Description Box) can be treated like a box that contains other boxes. Each Sample Entry is also just a normal box:
4 bytes - length in total
4 bytes - 4 char code of sample description table (stsd)
4 bytes - version & flags
4 bytes - number of sample entries (num_sample_entries)
[
4 bytes - length of sample entry (len_sample_entry)
4 bytes - 4 char code of sample entry
('len_sample_entry' - 8) bytes of data
] (repeated 'num_sample_entries' times)
(4 bytes - optional 0x00000000 as end of box marker )