.NET System.OutOfMemoryException on String.Split()

2020-07-11 06:42发布

I am using C# to read a ~120 MB plain-text CSV file. Initially I did the parsing by reading it line-by-line, but recently determined that reading the entire file contents into memory first was multiple times faster. The parsing is already quite slow because the CSV has commas embedded inside quotes, which means I have to use a regex split. This is the only one I have found that works reliably:

string[] fields = Regex.Split(line, 
@",(?!(?<=(?:^|,)\s*\x22(?:[^\x22]|\x22\x22|\\\x22)*,)
(?:[^\x22]|\x22\x22|\\\x22)*\x22\s*(?:,|$))");
// from http://regexlib.com/REDetails.aspx?regexp_id=621

In order to do the parsing after reading the entire contents into memory, I do a string split on the newline character to get an array containing each line. However, when I do this on the 120 MB file, I get a System.OutOfMemoryException. Why does it run out of memory so quickly when my computer has 4 GB of RAM? Is there a better way to quickly parse a complicated CSV?

9条回答
等我变得足够好
2楼-- · 2020-07-11 06:45

As other posters say, the OutOfMemory is because it cannot find a contiguous chunk of memory of the requested size.

However, you say that doing the parsing line by line was several times faster than reading it all in at once and then doing your processing. This only makes sense if you were pursuing the naive approach of doing blocking reads, eg (in pseudo code):

while(! file.eof() )
{
    string line = file.ReadLine();
    ProcessLine(line);
}

You should instead use streaming, where your stream is filled in by Write() calls from an alternate thread which is reading the file, so the file read is not blocked by whatever your ProcessLine() does, and vice-versa. That should be on-par with the performance of reading the entire file at once and then doing your processing.

查看更多
何必那么认真
3楼-- · 2020-07-11 06:45

You should probably try the CLR profiler to determine your actual memory usage. It might be that there are memory limits other than your system RAM. For example if this is an IIS application, your memory is limited by the application pools.

With this profile information you might find that you need to use a more scalable technique like the streaming of the CSV file that you originally attempted.

查看更多
放我归山
4楼-- · 2020-07-11 06:47

You may not be able to allocate a single object with that much contiguous memory, nor should you expect to be able to. Streaming is the ordinary way to do this, but you're right that it might be slower (although I don't think it should usually be quite that much slower.)

As a compromise, you could try reading a larger portion of the file (but still not the whole thing) at once, with a function like StreamReader.ReadBlock(), and processing each portion in turn.

查看更多
甜甜的少女心
5楼-- · 2020-07-11 06:50

You should read a chunk into a buffer and work on that. Then read another chunk and so on.

There are many libraries out there that will do this efficiently for you. I maintain one called CsvHelper. There are a lot of edge cases that you need to handle, such as when a comma or line ending is in the middle of a field.

查看更多
霸刀☆藐视天下
6楼-- · 2020-07-11 06:52

Don't roll your own parser unless you have to. I've had luck with this one:

A Fast CSV Reader

If nothing else you can look under the hood and see how someone else does it.

查看更多
做自己的国王
7楼-- · 2020-07-11 06:54

I agree with most everybody here, you need to use streaming.

I dont know if anybody has said so far, but you should look at an exstention method.

And I know, for sure, hands down, the best CSV splitting technique on .NET / CLR is this one

That technique generated me +10GB XML output's from input CSV, including exstensive input filters and all, faster than anything else I've seen.

查看更多
登录 后发表回答