dealing with large CSV files (20G) in ruby

2019-03-16 18:09发布

I am working on little problem and would have some advice on how to solve it: Given a csv file with an unknown number of columns and rows, output a list of columns with values and the number of times each value was repeated. without using any library.

if the file is small this shouldn't be a problem, but when it is a few Gigs, i get NoMemoryError: failed to allocate memory. is there a way to create a hash and read from the disk instead of loading the file to Memory? you can do that in perl with tied Hashes

EDIT: will IO#foreach load the file into memory? how about File.open(filename).each?

标签: ruby parsing csv
2条回答
何必那么认真
2楼-- · 2019-03-16 18:40

Do you read the whole file at once? Reading it on a per-line basis, i.e. using ruby -pe, ruby -ne or $stdin.each should reduce the memory usage by garbage collecting lines which were processed.

data = {}
$stdin.each do |line|
  # Process line, store results in the data hash.
end

Save it as script.rb and pipe the huge CSV file into this script's standard input:

ruby script.rb < data.csv

If you don't feel like reading from the standard input we'll need a small change.

data = {}
File.open("data.csv").each do |line|
  # Process line, store results in the data hash.
end
查看更多
闹够了就滚
3楼-- · 2019-03-16 18:45

Read the file one line at a time, discarding each line as you go:

open("big.csv") do |csv|
  csv.each_line do |line|
    values = line.split(",")
    # process the values
  end
end

Using this method, you should never run out of memory.

查看更多
登录 后发表回答