I have a file with a lot of lines (say 1 billion).
A script is iterating through all those lines to compare them against another data set.
Since this is running on 1 thread/1 core at the moment, I'm wondering if I could start multiple forks, each processing a part of the file simultaneously.
The only solution that came to my mind so far is the sed
unix command.
With sed it's possible to read "slices" of a file (line x to line y).
So, a couple of forks could process the output of corresponding seds. However the problem is that Ruby would load the whole sed output into RAM first.
Are there better solutions for this than sed, or is there a way to "stream" the sed output into Ruby?
What you are asking for wont actually help you.
Firstly, to jump to line n of a file, you firstly have to read the previous part of the file, to count the number of line breaks there are. For example:
$ ruby -e '(1..10000000).each { |i| puts "This is line number #{i}"}' > large_file.txt
$ du -h large_file.txt
266M large_file.txt
$ purge # mac os x command - clears any in memory disk caches in use
$ time sed -n -e "5000000p; 5000000q" large_file.txt
This is line number 5000000
sed -n -e "5000000p; 5000000q" large_file.txt 0.52s user 0.13s system 28% cpu 2.305 total
$ time sed -n -e "5000000p; 5000000q" large_file.txt
This is line number 5000000
sed -n -e "5000000p; 5000000q" large_file.txt 0.49s user 0.05s system 99% cpu 0.542 total
Note how the sed
command wasn't instant, it had to read through the initial part of the file to figure out where the 5 millionth line was. That is why running it a second time is so much faster for me - my computer cached the file into ram.
Even if you do pull this off (by splitting the file manually), you will get poor IO performance if you are constantly jumping between different parts of a file or files for reading the next line.
What would be better is to process every nth line on a separate thread (or process) instead. This will allow use of multiple cpu cores, yet still have good IO performance. This can easily be done with the parallel library.
Example use (my computer has 4 cores):
$ ruby -e '(1..10000000).each { |i| puts "This is line number #{i}"}' > large_file.txt # use a smaller file to speed up the tests
$ time ruby -r parallel -e "Parallel.each(File.open('large_file.txt').each_line, in_processes: 4) { |line| puts line if (line * 10000) =~ /9999/ }"
This is line number 9999
This is line number 19999
This is line number 29999
This is line number 39999
This is line number 49999
This is line number 59999
This is line number 69999
This is line number 79999
This is line number 89999
This is line number 99990
This is line number 99991
This is line number 99992
This is line number 99993
This is line number 99994
This is line number 99995
This is line number 99996
This is line number 99997
This is line number 99999
This is line number 99998
ruby -r parallel -e 55.84s user 10.73s system 400% cpu 16.613 total
$ time ruby -r parallel -e "Parallel.each(File.open('large_file.txt').each_line, in_processes: 1) { |line| puts line if (line * 10000) =~ /9999/ }"
This is line number 9999
This is line number 19999
This is line number 29999
This is line number 39999
This is line number 49999
This is line number 59999
This is line number 69999
This is line number 79999
This is line number 89999
This is line number 99990
This is line number 99991
This is line number 99992
This is line number 99993
This is line number 99994
This is line number 99995
This is line number 99996
This is line number 99997
This is line number 99998
This is line number 99999
ruby -r parallel -e 47.04s user 7.46s system 97% cpu 55.738 total
The second version (using 4 processes) completed 29.81% of the time of the original, nearly 4 times faster.
You can do this with fork
or threads
. In both cases, you'll have to write something that manages them, and determines how many sub-processes are needed, and how many lines of the file each is supposed to process.
So, for that first piece of code, you'd want to scan the file and determine how many lines it contains. You could do that using the following command if you're on *nix or Mac OS:
lc = `wc -l path/to/file`.to_i
or by simply opening the file and incrementing a counter as you read lines. Ruby is pretty fast at doing this, but on a file containing "6 billion" lines, wc
might be a better choice:
lc = 0
File.foreach('path/to/file') { lc += 1 }
Divide that by the number of sub-processes you want to manage:
NUM_OF_PROCESSES = 5
lines_per_process = lc/NUM_OF_PROCESSES
Then start your processes, telling them where to start processing, and for how many lines:
require 'threads'
children = []
1.step(lc, lines_per_process) do |start_line|
children << Thread.new do
cur_line = 0
File.foreach('path/to/file') do |li|
cur_line += 1
next unless (cur_line === start_line .. (start_line + lines_per_process)
# ... do something with the lines read
end
end
end
# wait for them to finish
children.each { |c| c.join }
That's untested, but is where I'd start.