I need to write bytes of an IEnumerable<byte>
to a file.
I can convert it to an array and use Write(byte[])
method:
using (var stream = File.Create(path))
stream.Write(bytes.ToArray());
But since IEnumerable
doesn't provide the collection's item count, using ToArray
is not recommended unless it's absolutely necessary.
So I can just iterate the IEnumerable
and use WriteByte(byte)
in each iteration:
using (var stream = File.Create(path))
foreach (var b in bytes)
stream.WriteByte(b);
I wonder which one will be faster when writing lots of data.
I guess using Write(byte[])
sets the buffer according to the array size so it would be faster when it comes to arrays.
My question is when I just have an IEnumerable<byte>
that has MBs of data, which approach is better? Converting it to an array and call Write(byte[])
or iterating it and call WriteByte(byte)
for each?
Enumerating over a large stream of bytes is a process that adds tons of overhead to something that is normally cheap: Copying bytes from one buffer to the next.
Normally, LINQ-style overhead does not matter much but when it comes to processing 100 million bytes per second on a normal hard drive you will notice severe overheads. This is not premature optimization. We can foresee that this will be a performance hotspot so we should eagerly optimize.
So when copying bytes around you probably should not rely on abstractions like
IEnumerable
andIList
at all. Pass around arrays orArraySegement<byte>
's which also containOffset
andCount
. This frees you from slicing arrays too often.One thing that is a death-sin with high-throughput IO, too, is calling a method per byte. Like reading bytewise and writing bytewise. This kills performance because these methods have to be called hundreds of millions of times per second. I have experienced that myself.
Always process entire buffers of at least 4096 bytes at a time. Depending on what media you are doing IO with you can use much larger buffers (64k, 256k or even megabytes).
You should profile which version is faster. The
FileStream
class has an internal buffer that decouples theRead()
andWrite()
methods a bit from the actual file system accesses.If you don't specify a buffer size in the
FileStream
constructor, it uses something like 4096 Bytes of buffer by default. That buffer will combine many of yourWriteByte()
calls into one write to the underlying file. The only question is whether the overhead of theWriteByte()
calls will exceed the overhead of theEnumerable.ToArray()
call. The latter definitely will use more memory, but you always have to deal with this sort of trade-off.FYI: The current .NET 4 implementation of
Enumerable.ToArray()
involves growing an array by duplicating its size whenever necessary. Each time it grows, all values are copied over. Also, when all items are stored in the array, its content is copied again to an array of the final size. ForIEnumerable<T>
instances that actually implementICollection<T>
, the code takes advantage of that fact to start with the correct array size and let the collection to the copying instead.