How to parse a text file in C# and be io bound?

2019-01-12 04:24发布

站内文章 / C#

25 0

做个烂人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

It is known that if you read data from disc you are IO bound and you can process/parse the read data much faster than you can read it from disc.

But this common wisdom (myth?) is not reflected by my tests. When I do read a text file with a double and and int in each line separated by a space I am much slower than my physical disc speed (factor 6). The text file looks like this

1,1 0
2,1 1
3,1 2

Update I have included the PInvoke performance when I do a ReadFile with the complete buffer in one read to get the "real" performance.

ReadFile performance - ReadFileIntoByteBuffer
StringReader.ReadLine performance - CountLines
StringReader.Readline unsafe perf - ParseLinesUnsafe
StringReader.Read unsafe char buf - ParseLinesUnsafeCharBuf
StringReader.ReadLine + Parsing performance - ParseLines

The results are

Did native read 179,0MB in                    0,4s, 484,2MB/s
Did read 10.000.000 lines in                  1,6s, 112,7MB/s
Did parse and read unsafe 179,0MB in          2,3s,  76,5MB/s
Did parse and read unsafe char buf 179,0MB in 2,8s,  63,5MB/s
Did read and parse 179,0MB in                 9,3s,  19,3MB/s

Although I did try to skip the string construction overhead in ParseLinesUnsafeCharBuf it is still quite a lot slower than the version which allocates a new string every time. It is still much better than the original 20 MB with the easiest solutiono but I do think .NET should be able to do better. If the remoe the logic to parse the strings I do get 258,8 MB/s which is very good and near native speed. But I do not see a way using unsafe code to make my parsing much simpler. I do have to deal with incomplete lines which makes it quite complex.

Update It is clear from the numbers that a simple string.split does cost already way too much. But the StringReader does also cost quite a lot. How would a highly optimized solution look like that gets closer to the real disc speed? I have tried many ways with unsafe code and char buffers but the performance gain was perhaps 30% but nothing in the order of magnitudes I would need. I would be ok with 100MB/s parsing speed. That should be achievable with managed code or am I wrong?

Is it not possible with C# to parse faster than I can read from my hard disc? It is a Intel Postville X25M. The CPU is and older Intel Dual Core. I have 3 GB RAM Windows 7 .NET 3.5 SP1 and .NET 4.

But I did see the same results on normal hard discs as well. Linear reading speed can be up to 400MB/s with todays hard discs. Does this imply that I should restructure my applications to read the data on demand when it is actually needed instead of reading it eagerly into memory at the cost of higher GC times due to the increased object graph which make the GC cycles much longer.

I have noticed that if I my managed application usese more than 500MB of memory it becomes much less responsive. A major contributing factor seems the complexity of the object graph. It might therefore be better to read the data when needed. At least this is my conclusion of the current data.

Here is the code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using System.Diagnostics;
using System.Runtime.InteropServices;
using Microsoft.Win32.SafeHandles;
using System.ComponentModel;

namespace IOBound
{
    class Program
    {
        static void Main(string[] args)
        {
            string data = @"C:\Source\IOBound\NumericData.txt";
            if (!File.Exists(data))
            {
                CreateTestData(data);
            }

            int MB = (int) (new FileInfo(data).Length/(1024*1024));

            var sw = Stopwatch.StartNew();
            uint bytes = ReadFileIntoByteBuffer(data);
            sw.Stop();
            Console.WriteLine("Did native read {0:F1}MB in {1:F1}s, {2:F1}MB/s",
                bytes/(1024*1024), sw.Elapsed.TotalSeconds, MB / sw.Elapsed.TotalSeconds);

            sw = Stopwatch.StartNew();
            int n = CountLines(data);
            sw.Stop();
            Console.WriteLine("Did read {0:N0} lines in {1:F1}s, {2:F1}MB/s",
                n, sw.Elapsed.TotalSeconds, MB / sw.Elapsed.TotalSeconds);

            sw = Stopwatch.StartNew();
            ParseLinesUnsafe(data);
            sw.Stop();
            Console.WriteLine("Did parse and read unsafe {0:F1}MB in {1:F1}s, {2:F1}MB/s",
                MB, sw.Elapsed.TotalSeconds, MB / sw.Elapsed.TotalSeconds);

            sw = Stopwatch.StartNew();
            ParseLinesUnsafeCharBuf(data);
            sw.Stop();
            Console.WriteLine("Did parse and read unsafe char buf {0:F1}MB in {1:F1}s, {2:F1}MB/s",
                MB, sw.Elapsed.TotalSeconds, MB / sw.Elapsed.TotalSeconds);

            sw = Stopwatch.StartNew();
            ParseLines(data);
            sw.Stop();
            Console.WriteLine("Did read and parse {0:F1}MB in {1:F1}s, {2:F1}MB/s",
                MB, sw.Elapsed.TotalSeconds, MB / sw.Elapsed.TotalSeconds);

        }

        private unsafe static uint ReadFileIntoByteBuffer(string data)
        {
            using(var stream = new FileStream(data, FileMode.Open))
            {
                byte[] buf = new byte[200 * 1024 * 1024];
                fixed(byte* pBuf = &buf[0])
                {
                    uint dwRead = 0;
                    if (ReadFile(stream.SafeFileHandle, pBuf, 200 * 1000 * 1000, out dwRead, IntPtr.Zero) == 0)
                    {
                        throw new Win32Exception();
                    }
                    return dwRead;
                }

            }
        }

        private static int CountLines(string data)
        {
            using (var reader = new StreamReader(data))
            {
                string line;
                int count = 0;
                while ((line = reader.ReadLine()) != null)
                {
                    count++;
                }

                return count;
            }
        }

        unsafe private static void ParseLinesUnsafeCharBuf(string data)
        {
            var dobules = new List<double>();
            var ints = new List<int>();

            using (var reader = new StreamReader(data))
            {
                double d = 0;
                long a = 0, b = 0;
                int i = 0;
                char[] buffer = new char[10*1000*1000];
                int readChars = 0;
                int startIdx = 0;

                fixed(char *ln = buffer)
                {
                    while ((readChars = reader.Read(buffer, startIdx, buffer.Length - startIdx)) != 0)
                    {
                        char* pEnd = ln + readChars + startIdx;
                        char* pCur = ln;
                        char* pLineStart = null;

                        while (pCur != pEnd)
                        {
                            a = 0;
                            b = 0;

                            while (pCur != pEnd && *pCur == '\r' || *pCur == '\n')
                            {
                                pCur++;
                            }
                            pLineStart = pCur;

                            while(pCur != pEnd && char.IsNumber(*pCur))
                            {
                                a = a * 10 + (*pCur++ - '0');
                            }
                            if (pCur == pEnd || *pCur == '\r')
                            {
                                goto incompleteLine;
                            }

                            if (*pCur++ == ',')
                            {
                                long div = 1;
                                while (pCur != pEnd && char.IsNumber(*pCur))
                                {
                                    b += b * 10 + (*pCur++ - '0');
                                    div *= 10;
                                }
                                if (pCur == pEnd || *pCur == '\r')
                                {
                                    goto incompleteLine;
                                }
                                d = a + ((double)b) / div;
                            }
                            else
                            {
                                goto skipRest;
                            }

                            while (pCur != pEnd && char.IsWhiteSpace(*pCur))
                            {
                                pCur++;
                            }
                            if (pCur == pEnd || *pCur == '\r')
                            {
                                goto incompleteLine;
                            }

                            i = 0;
                            while (pCur != pEnd && char.IsNumber(*pCur))
                            {
                                i = i * 10 + (*pCur++ - '0');
                            }
                            if (pCur == pEnd)
                            {
                                goto incompleteLine;
                            }

                            dobules.Add(d);
                            ints.Add(i);

                            continue;

incompleteLine:
                            startIdx = (int)(pEnd - pLineStart);
                            Buffer.BlockCopy(buffer, (int)(pLineStart - ln) * 2, buffer, 0, 2 * startIdx);
                            break;
skipRest:
                            while (pCur != pEnd && *pCur != '\r')
                            {
                                pCur++;   
                            }
                            continue;
                        }
                    }
                }
            }
        }

        unsafe private static void ParseLinesUnsafe(string data)
        {
            var dobules = new List<double>();
            var ints = new List<int>();

            using (var reader = new StreamReader(data))
            {
                string line;
                double d=0;
                long a = 0, b = 0;
                int ix = 0;
                while ((line = reader.ReadLine()) != null)
                {
                    int len = line.Length;
                    fixed (char* ln = line)
                    {
                        while (ix < len && char.IsNumber(ln[ix]))
                        { 
                            a = a * 10 + (ln[ix++] - '0');
                        }

                        if (ln[ix] == ',')
                        {
                            ix++;
                            long div = 1;
                            while (ix < len && char.IsNumber(ln[ix]))
                            {
                                b += b * 10 + (ln[ix++] - '0');
                                div *= 10;
                            }
                            d = a + ((double)b) / div;
                        }

                        while (ix < len && char.IsWhiteSpace(ln[ix]))
                        {
                            ix++;
                        }

                        int i = 0;
                        while (ix < len && char.IsNumber(ln[ix]))
                        { 
                            i = i * 10 + (ln[ix++] - '0');
                        }

                        dobules.Add(d);
                        ints.Add(ix);
                    }
                }
            }
        }



        private static void ParseLines(string data)
        {
            var dobules = new List<double>();
            var ints = new List<int>();

            using (var reader = new StreamReader(data))
            {
                string line;
                char[] sep  = new char[] { ' ' };
                while ((line = reader.ReadLine()) != null)
                {
                    var parts = line.Split(sep);
                    if (parts.Length == 2)
                    {
                        dobules.Add( double.Parse(parts[0]));
                        ints.Add( int.Parse(parts[1]));
                    }
                }
            }
        }

        static void CreateTestData(string fileName)
        {
            FileStream fstream = new FileStream(fileName, FileMode.Create);
            using (StreamWriter writer = new StreamWriter(fstream, Encoding.UTF8))
            {
                for (int i = 0; i < 10 * 1000 * 1000; i++)
                {
                    writer.WriteLine("{0} {1}", 1.1d + i, i);
                }
            }
        }

        [DllImport("kernel32.dll", SetLastError = true)]
        unsafe static extern uint ReadFile(SafeFileHandle hFile, [Out] byte* lpBuffer, uint nNumberOfBytesToRead, out uint lpNumberOfBytesRead, IntPtr lpOverlapped);

    }
}

回答1:

So there are a couple of problems here. Others have already commented about windows' IO caching as well as the actual hardware cache so I'm going to leave that alone.

The other issue is that your measuring the combined operations of read() + parse(), and comparing that to the speed of just read(). Essentially you need to conscious of the fact that A + B will always be more than A (assuming non-negative).

So to find out if you are IO bound you need to find out how long does it take to read the file. You've done that. On my machine your test is running at about 220ms for reading the file.

Now you need measure how long does it take to parse that many different strings. This is a little trickier to isolate. So let's just say we leave them together and subtract the time it takes to read from the parse time. Further we aren't trying to measure what your doing with the data, but just the parsing, so throw out the List and List and let's just parse. Running this on my machine gives it about 1000ms, less the 220ms for reading, your parse code takes about 780ms per 1 million rows.

So why is it so slow (3-4x slower than the read)? Again let's eliminate some stuff. Commenting out the int.Parse and the double.Parse and run again. That's a lot better 460ms less the read time of 220, we are now at 240ms to parse. Of course the 'parse' is only calling string.Split(). Hrmmm looks like string.Split will cost you as much as the disk IO, not surprisingly considering how .NET deals with strings.

So can C# parse as fast or faster than reading from the disk? Well yes, it can, but your going to have to get nasty. You see int.Parse and double.Parse suffer from the fact that they are culture aware. Due to this and the fact that these parse routines deal with many formats they are somewhat expensive at the magnitude of your example. I mean to say we are parsing a double and and int every microsecond (one-millionth of a second) which isn't bad normally.

So to match the speed of the disk-read (and thus be IO bound) we need to rewrite how you process a text line. Here is a nasty example, but it works for your example...

int len = line.Length;
fixed (char* ln = line)
{
    double d;
    long a = 0, b = 0;
    int ix = 0;
    while (ix < len && char.IsNumber(ln[ix]))
        a = a * 10 + (ln[ix++] - '0');
    if (ln[ix] == '.')
    {
        ix++;
        long div = 1;
        while (ix < len && char.IsNumber(ln[ix]))
        {
            b += b * 10 + (ln[ix++] - '0');
            div *= 10;
        }
        d = a + ((double)b)/div;
    }

    while (ix < len && char.IsWhiteSpace(ln[ix]))
        ix++;

    int i = 0;
    while (ix < len && char.IsNumber(ln[ix]))
        i = i * 10 + (ln[ix++] - '0');
}

Running this crappy code produces a runtime of about 450ms, or roughly 2n of the read time. So, pretending for a moment that you thought the above code fragment was acceptable (which god I hope you don't), you could have one thread reading strings and another parsing and you would be close to being IO bound. Put two threads on parsing and you will be IO bound. Should you do this is another question all together.

So let's go back to your original question:

It is known that if you read data from disc you are IO bound and you can process/parse the read data much faster than you can read it from disc.

But this common wisdom (myth?)

Well no I would not call this a myth. In fact I would debate that your original code is still IO Bound. You happen to be running your test in isolation so the impact is small 1/6th the time spent reading from the device. But consider what would happen if that disk is busy? What if your antivirus scanner is churning through every file? Simply put your program would slow down with the increased HDD activity, and it could become IO Bound.

IMHO, the reason for this "common wisdom" is this:

It's easier to get IO bound on writes than on reads.

Writing to the device takes longer and is generally more expensive than producing the data. If you want to see IO Bound in action look at your "CreateTestData" method. Your CreateTestData method takes 2x as long to write the data to disk than just calling String.Format(...). And this is with full caching. Turn caching off (FileOptions.WriteThrough) and try it again... now CreateTestData is 3x-4x slower. Try it for yourself with the following methods:

static int CreateTestData(string fileName)
{
    FileStream fstream = new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None, 4096, FileOptions.WriteThrough);
    using (StreamWriter writer = new StreamWriter(fstream, Encoding.UTF8))
    {
        for (int i = 0; i < linecount; i++)
        {
            writer.WriteLine("{0} {1}", 1.1d + i, i);
        }
    }
    return linecount;
}
static int PrintTestData(string fileName)
{
    for (int i = 0; i < linecount; i++)
    {
        String.Format("{0} {1}", 1.1d + i, i);
    }
    return linecount;
}

This is just for starters, if you really want to get IO bound you start using direct IO. See the documentation on CreateFile using FILE_FLAG_NO_BUFFERING. Writing gets much slower as you start to bypass hardware caches and wait for IO completion. This is one major reason why a traditional database is very slow to write to. They must force the hardware to complete the write and wait upon it. Only then can they call a transaction 'committed', the data is in the file on the physical device.

UPDATED

Ok Alois, it appears your just looking for how fast can you go. To go any faster you need to stop dealing with strings and characters and remove the allocations to go faster. The following code improves upon the line/character parser above by about an order of magnitude (adding about 30ms over just counting lines) while allocating only a single buffer on the heap.

WARNING You need to realize I'm demonstrating that it can be done fast. I'm not advising you to go down this road. This code has some serious limitations and/or bugs. Like what happens when you hit a double in the form of "1.2589E+19"? Frankly I think you should stick with your original code and not worry about trying to optimize it this much. Either that or change the file format to binary instead of text (see BinaryWriter). If you are using binary, you can use a variation of the following code with BitConvert.ToDouble/ToInt32 and it would be even faster.

private static unsafe int ParseFast(string data)
{
    int count = 0, valid = 0, pos, stop, temp;
    byte[] buffer = new byte[ushort.MaxValue];

    const byte Zero = (byte) '0';
    const byte Nine = (byte) '9';
    const byte Dot = (byte)'.';
    const byte Space = (byte)' ';
    const byte Tab = (byte) '\t';
    const byte Line = (byte) '\n';

    fixed (byte *ptr = buffer)
    using (Stream reader = File.OpenRead(data))
    {
        while (0 != (temp = reader.Read(buffer, valid, buffer.Length - valid)))
        {
            valid += temp;
            pos = 0;
            stop = Math.Min(buffer.Length - 1024, valid);
            while (pos < stop)
            {
                double d;
                long a = 0, b = 0;
                while (pos < valid && ptr[pos] >= Zero && ptr[pos] <= Nine)
                    a = a*10 + (ptr[pos++] - Zero);
                if (ptr[pos] == Dot)
                {
                    pos++;
                    long div = 1;
                    while (pos < valid && ptr[pos] >= Zero && ptr[pos] <= Nine)
                    {
                        b += b*10 + (ptr[pos++] - Zero);
                        div *= 10;
                    }
                    d = a + ((double) b)/div;
                }
                else
                    d = a;

                while (pos < valid && (ptr[pos] == Space || ptr[pos] == Tab))
                    pos++;

                int i = 0;
                while (pos < valid && ptr[pos] >= Zero && ptr[pos] <= Nine)
                    i = i*10 + (ptr[pos++] - Zero);

                DoSomething(d, i);

                while (pos < stop && ptr[pos] != Line)
                    pos++;
                while (pos < stop && !(ptr[pos] >= Zero && ptr[pos] <= Nine))
                    pos++;
            }

            if (pos < valid)
                Buffer.BlockCopy(buffer, pos, buffer, 0, valid - pos);
            valid -= pos;
        }
    }
    return count;
}

回答2:

As other answers/comments have mentioned, you are probably reading the file from cache anyway so the disk/SDRAM speed is not limiting factor.

When you parse the file, you are doing many more allocations from the heap (when you split the string, and when you add the auto-boxed parsed values to your lists) than when you just count lines. It is the cost of these heap allocations that is probably making the difference in performance between the two passes.

You can parse faster than you can read from the disk, but parsers optimized for speed are vary careful about memory usage.

回答3:

Just couple of suggestions (if you are willing to live with more complicated implementation)...

You could use LinkedList instead of List to avoid re-allocation/copy when Add-ing.
Replace Split with handwritten code that searches for separator and then uses Substring to extract the "fields". You only need to search for a single separator and number of fields is known in advance, so Split is too general for you.
Use Read instead of ReadLine so you can re-use the same buffer and avoid allocating new string for every line.
If you are really performance-conscientious, separate parsing into concurrent Tasks. While you are at it, put the file reading into its own task as well.

回答4:

There is exactly one "common wisdom" about performance - decide how fast you want process to be and measure everything, act on the data collected and repeat.

As of now your code does probably 10x memory allocations for each byte read (there are many layers of readers, string readers, and like). If you want absolute speed you likely need to eliminate most of the re-allocations. I would first rewrite code to use single reader all the way and measure if performance is good enough.

回答5:

IF you work with big files the fastest you can get is AFAIK the MemoryMappedFile class (new in .NET 4).

By this you basically use the OS FileSystem cache-manager memory pages directly... if this is not fast enough then you would need to write your own filesystem...

As for the GC the behaviour can be customized via app.config - for example:

<runtime>
   <gcServer enabled="true" />
</runtime>

For the GC customization options see http://msdn.microsoft.com/en-us/library/6bs4szyc.aspx - esp. <gcConcurrent> / <gcServer> .

回答6:

Throwing a profiler at it, it looks like the majority of the time is indeed spent parsing the file.

I tried your sample passing in a MemoryStream with the already read byte[] instead of a path to the ParseLines method, and the difference between parsing from a file path and parsing from in memory bytes was negligible.

In otherwords it's the processing that's done, not the reading that took up a significant amount of time.

I threw this profiler at it: http://code.google.com/p/slimtune/

And from there I can see that of the ParseLines method when called on a memory stream had the following timings:

25.34% in System.Io.StreamReader.ReadLine()
23.86% in System.Double.Parse
21.72% in System.Number.ParseInt32
20.91% in System.String.Split
--Some other not very significant methods--

So what this tells me is that even reading lines from a memory stream is somewhat slow, as are most string operations.

回答7:

I suspect caching is skewing the measurements you are making. The hard disk has RAM that it uses to cache recently accessed data, and so does the operating system.

For a better test, you would need to reboot between tests. I suggest a complete power down to ensure the disk's RAM is also cleared.