I am working on little problem and would have some advice on how to solve it: Given a csv file with an unknown number of columns and rows, output a list of columns with values and the number of times each value was repeated. without using any library.
if the file is small this shouldn't be a problem, but when it is a few Gigs, i get NoMemoryError: failed to allocate memory. is there a way to create a hash and read from the disk instead of loading the file to Memory? you can do that in perl with tied Hashes
EDIT: will IO#foreach load the file into memory? how about File.open(filename).each?
Do you read the whole file at once? Reading it on a per-line basis, i.e. using
ruby -pe
,ruby -ne
or$stdin.each
should reduce the memory usage by garbage collecting lines which were processed.Save it as
script.rb
and pipe the huge CSV file into this script's standard input:If you don't feel like reading from the standard input we'll need a small change.
Read the file one line at a time, discarding each line as you go:
Using this method, you should never run out of memory.