reading a block of lines in a file using php

Considering i have a 100GB txt file containing millions of lines of text. How could i read this text file by block of lines using PHP?

i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.

If i'll be using fread($fp,5030) wherein '5030' is some length value for which it has to read. Would there be a case where it won't read the whole line(such as stop at the middle of the line) because it has reached the max length?

标签： php file fgets fread

5条回答

Explosion°爆炸

2楼-- · 2019-01-27 01:35

i think that you have to use fread($fp, somesize), and check manually if you have founded the end of the line, otherwise read another chunk.

Hope this helps.

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2019-01-27 01:42

i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.

Don't see, why you shouldn't be able to use fgets()

$blocksize = 50; // in "number of lines"
while (!feof($fh)) {
  $lines = array();
  $count = 0;
  while (!feof($fh) && (++$count <= $blocksize)) {
    $lines[] = fgets($fh);
  }
  doSomethingWithLines($lines);
}

Reading 100GB will take time anyway.

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

4楼-- · 2019-01-27 01:45

I know this is an old question, but I think there is value for a new answer for anyone that finds this question eventually.

I agree that reading 100GB takes time, that I why I also agree that we need to find the most effective option to read it so it can be as little as possible instead of just thinking "who cares how much it is if is already a lot", so, lets find out our lowest time possible.

Another solution:

Cache a chunk of raw data

Use fread to read a cache of that data

Read line by line

Read line by line from the cache until end of cache or end of data found

Read next chunk and repeat

Grab the un processed last part of the chunk (the one you were looking for the line delimiter) and move it at the front, then reads a chunk of the size you had defined minus the size of the unprocessed data and put it just after that un processed chunk, then, there you go, you have a new complete chunk.
Repeat the read by line and this process until the file is read completely.

You should use a cache chunk bigger than any expected size of line.

The bigger the cache size the faster you read, but the more memory you use.

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

5楼-- · 2019-01-27 01:49

The fread approach sounds like a reasonable solution. You can detect whether you've reached the end of a line by checking whether the final character in the string is a newline character ('\n'). If it isn't, then you can either read some more characters and append them to your existing string, or you can trim characters from your string back to the last newline, and then use fseek to adjust your position in the file.

Side point: Are you aware that reading a 100GB file will take a very long time?

0人赞添加讨论(0) 举报

做自己的国王

6楼-- · 2019-01-27 01:57

I would recommend implementing the reading of a single line within a function, hiding the implementation details of that specific step from the rest of your code - the processing function must not care how the line was retrieved. You can then implement your first version using fgets() and then try other methods if you notice that it is too slow. It could very well be that the initial implementation is too slow, but the point is: you won't know until you've benchmarked.

0人赞添加讨论(0) 举报