Python bytearray verses list of bytes

I'm trying to find out the most efficient way to create a long byte string (or bytearray) by concatenating multiple shorter strings when the length of the whole string is known beforehand. I made this script and came up with these results:

import time

MSG = b'test message'
COUNT = 30000

def bytes_list_test():  
    tStart = time.clock()
    l = []
    for i in range(COUNT):
        l.append(MSG)
    bs = b''.join(l)
    print('byte list time:', time.clock() - tStart)

def bytearray_test():
    tStart = time.clock()
    ba = bytearray()
    for i in range(COUNT):
        for c in MSG:
            ba.append(c)
    print('array time:', time.clock() - tStart)

def initialized_bytearray_test():
    tStart = time.clock()
    ba = bytearray([0x00]*len(MSG)*COUNT)
    for i in range(COUNT):
        ba[i*len(MSG):i*len(MSG)+len(MSG)] = MSG
    print('initialized array time:', time.clock() - tStart)

bytes_list_test()
bytearray_test()
initialized_bytearray_test()

Results:

byte list time:         0.0076534920117410365
array time:             0.08107178658246994
initialized array time: 0.08843219671325642

A few questions:

1) Is creating a list of bytes and using the join() method the way to go as implied by the results?

2) Why is using a list of bytes so much faster than using a bytearray which seems like it's designed for this type of thing?

3) You would think the initialized array would be faster than the uninitialized array because the initialized array doesn't have to be resized (Note that it did occasionally perform better, but not by much and inconsistently so). Is it not faster because of the slicing operation?

The first function creates a list of pointers to the same object (NOT a list of bytes), then join would do one memory allocation and COUNT calls to memcpy.

You can make the first function times faster (5x in my test) by dropping temporary list and using itertools.repeat:

def bytes_list_test_opt():  
    tStart = time.clock()
    bs = b''.join(itertools.repeat(MSG, COUNT))
    print('byte list opt time:', time.clock() - tStart)

or, in this particular case, simply use * operator of bytes objects, which does exactly that:

    bs = MSG*COUNT

The second function repeatedly iterates over MSG, stores data byte-by-byte and has to repeatedly reallocate memory as the bytearray grows.

You can make the second function almost as fast as the original (unoptimized) first one by replacing iteration with single call to extend:

def bytearray_test_opt():
    tStart = time.clock()
    ba = bytearray()
    for i in range(COUNT):
        ba.extend(MSG)
    print('array opt time:', time.clock() - tStart)

After this modification, the second function will be slower than the first one only because of additional reallocations (~15% in my test).

The third function uses bytearray's slice assignment, which accepts iterable and seems to be doing the same byte-by-byte iteration without recognizing that they could just memcpy bytes into the place. This looks like a defect in the standard library that can be fixed.

As you see from the previous optimization, allocations take very small amount time compared to byte-by-byte copying, so preallocating has no visible impact here. You could save some time on doing fewer calculations, but it wont help much either:

def initialized_bytearray_test_opt():
    tStart = time.clock()
    L = len(MSG)
    ba = bytearray(L*COUNT)
    ofs = 0
    for i in range(COUNT):
        ba[ofs : ofs+L] = MSG
        ofs += L
    print('initialized array opt time:', time.clock() - tStart)

The final timings from my machine:

byte list time: 0.004823000000000001
byte list opt time: 0.0008649999999999977
array time: 0.043324
array opt time: 0.005505999999999997
initialized array time: 0.05936899999999999
initialized array opt time: 0.040164000000000005

P.S. Use timeit module to perform measures like this, it provides much higher accuracy.