I have a file with a lot of lines (say 1 billion). A script is iterating through all those lines to compare them against another data set.
Since this is running on 1 thread/1 core at the moment, I'm wondering if I could start multiple forks, each processing a part of the file simultaneously.
The only solution that came to my mind so far is the sed
unix command.
With sed it's possible to read "slices" of a file (line x to line y).
So, a couple of forks could process the output of corresponding seds. However the problem is that Ruby would load the whole sed output into RAM first.
Are there better solutions for this than sed, or is there a way to "stream" the sed output into Ruby?
You can do this with
fork
orthreads
. In both cases, you'll have to write something that manages them, and determines how many sub-processes are needed, and how many lines of the file each is supposed to process.So, for that first piece of code, you'd want to scan the file and determine how many lines it contains. You could do that using the following command if you're on *nix or Mac OS:
or by simply opening the file and incrementing a counter as you read lines. Ruby is pretty fast at doing this, but on a file containing "6 billion" lines,
wc
might be a better choice:Divide that by the number of sub-processes you want to manage:
Then start your processes, telling them where to start processing, and for how many lines:
That's untested, but is where I'd start.
What you are asking for wont actually help you.
Firstly, to jump to line n of a file, you firstly have to read the previous part of the file, to count the number of line breaks there are. For example:
Note how the
sed
command wasn't instant, it had to read through the initial part of the file to figure out where the 5 millionth line was. That is why running it a second time is so much faster for me - my computer cached the file into ram.Even if you do pull this off (by splitting the file manually), you will get poor IO performance if you are constantly jumping between different parts of a file or files for reading the next line.
What would be better is to process every nth line on a separate thread (or process) instead. This will allow use of multiple cpu cores, yet still have good IO performance. This can easily be done with the parallel library.
Example use (my computer has 4 cores):
The second version (using 4 processes) completed 29.81% of the time of the original, nearly 4 times faster.