How to load a huge file into a String or list in R

2019-07-26 20:30发布

问题:

I have a HUGE file that I need to do operations on. Huge as in approx. half a million words.

I just want to read it into a list or String so I can do things with it later.

Also I know I could load it into a string using file->string OR use file->list, file->lines, but these seem to take waayy too long.

Is this the right way to load it into a list?:

(define my-list (with-input-from-file "myFile.txt" read))

Whenever I run my program I just get the first line printed out. Seems to work for smaller files though.

回答1:

I'm going to assume that by half a million words, you are meaning your file is about 5 GB.

If this is the case, you really don't want to read the whole thing into memory. I mean, sure, the whole thing will technically fit into the RAM many computers have (although certainly not all), but it'll also take a while to do it. With a SSD this will take about 10 seconds, which is okay I guess, depending on your application it might be 100% fine, but it certainly isn't speedy for standard desktop app. However, if you're reading it from a HDD, it'll take a good 60 seconds. And that is presuming your hard drive hasn't fragmented the file, if so, it'll be even slower.

Both of the situations are the ideal minimum, and in practice loading a 5 GB file entirely into RAM is going to be slow at best. (Although in some very rare circumstances it is what you want, generally when you are doing high performance computing stuff.)

A better idea, as @Carcigenicate suggested, is to instead stream the file into your program lazily, so that you don't need to have the long pause. To do this, I recommend either in-input-port-bytes or in-bytes-lines. These both produce streams that you can then use to process your data, where the first one gives you one byte at a time, and the other gives you one line of bytes at a time. Both until you reach EOF. You can do this in a for

(call-with-input-file "file.txt"
  (lambda (f)
    (for/fold ([counter 0])
              ([i (in-input-port-bytes f)])
      (+ counter 1))

The above example is a slow way to calculate the number of bytes in a file. But it shows how you can use in-input-port-bytes.

There are other functions to create a stream of characters rather than bytes from a file: in-lines, read-port, etc.



回答2:

I have a strong feeling that your problem isn't reading the string in, but rather printing it out.

Specifically, reading a file of this size appears to take me approximately 0.03 seconds.

I generated a file using this program:

#lang racket

(define str
  "Beebe a reeble to one niner big druppy bonker watz. ")

(with-output-to-file "/tmp/foo.txt"
  (λ ()
    (for ([i (in-range (/ 500000 10))])
      (displayln str))))    

Then, I read it in like this:

#lang racket

(define a (time (file->string "/tmp/foo.txt")))

... and produced this output:

cpu time: 30 real time: 30 gc time: 17

.... Indicating 30 milliseconds.

Note that because I wrapped the file->string in a define, I was not printing the whole thing out. That would take a long long time.