What is the fastest way to read 10 GB file from th

2019-04-05 07:01发布

We need to read and count different types of messages/run some statistics on a 10 GB text file, e.g a FIX engine log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but the language doesn't really matter.

I have found some interesting tips in Tim Bray's WideFinder project. However, we've found that using memory mapping is inherently limited by the 32 bit architecture.

We tried using multiple processes, which seems to work faster if we process the file in parallel using 4 processes on 4 CPUs. Adding multi-threading slows it down, maybe because of the cost of context switching. We tried changing the size of thread pool, but that is still slower than simple multi-process version.

The memory mapping part is not very stable, sometimes it takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from page faults or something related to virtual memory usage. Anyway, Mmap cannot scale beyond 4 GB on a 32 bit architecture.

We tried Perl's IPC::Mmap and Sys::Mmap. Looked into Map-Reduce as well, but the problem is really I/O bound, the processing itself is sufficiently fast.

So we decided to try optimize the basic I/O by tuning buffering size, type, etc.

Can anyone who is aware of an existing project where this problem was efficiently solved in any language/platform point to a useful link or suggest a direction?

13条回答
地球回转人心会变
2楼-- · 2019-04-05 07:33

Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.

Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.

查看更多
别忘想泡老子
3楼-- · 2019-04-05 07:34

I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.

PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.

查看更多
乱世女痞
4楼-- · 2019-04-05 07:36

Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.

查看更多
一夜七次
5楼-- · 2019-04-05 07:43

Its not stated in the problem that sequence matters really or not. So, divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.

查看更多
在下西门庆
6楼-- · 2019-04-05 07:44

Perhaps you've already read this forum thread, but if not:

http://www.perlmonks.org/?node_id=512221

It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.

Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.

Best of luck.

查看更多
一纸荒年 Trace。
7楼-- · 2019-04-05 07:45

I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.

查看更多
登录 后发表回答