Considering i have a 100GB txt file containing millions of lines of text. How could i read this text file by block of lines using PHP?
i can't use file_get_contents();
because the file is too large. fgets()
also read the text line by line which will likely takes longer time to finish reading the whole file.
If i'll be using fread($fp,5030)
wherein '5030' is some length value for which it has to read. Would there be a case where it won't read the whole line(such as stop at the middle of the line) because it has reached the max length?
i can't use file_get_contents(); because the file is too large. fgets() also read the text line by line which will likely takes longer time to finish reading the whole file.
Don't see, why you shouldn't be able to use fgets()
$blocksize = 50; // in "number of lines"
while (!feof($fh)) {
$lines = array();
$count = 0;
while (!feof($fh) && (++$count <= $blocksize)) {
$lines[] = fgets($fh);
}
doSomethingWithLines($lines);
}
Reading 100GB will take time anyway.
The fread
approach sounds like a reasonable solution. You can detect whether you've reached the end of a line by checking whether the final character in the string is a newline character ('\n'
). If it isn't, then you can either read some more characters and append them to your existing string, or you can trim characters from your string back to the last newline, and then use fseek
to adjust your position in the file.
Side point: Are you aware that reading a 100GB file will take a very long time?
i think that you have to use fread($fp, somesize), and check manually if you have founded the end of the line, otherwise read another chunk.
Hope this helps.
I would recommend implementing the reading of a single line within a function, hiding the implementation details of that specific step from the rest of your code - the processing function must not care how the line was retrieved. You can then implement your first version using fgets()
and then try other methods if you notice that it is too slow. It could very well be that the initial implementation is too slow, but the point is: you won't know until you've benchmarked.
I know this is an old question, but I think there is value for a new answer for anyone that finds this question eventually.
I agree that reading 100GB takes time, that I why I also agree that we need to find the most effective option to read it so it can be as little as possible instead of just thinking "who cares how much it is if is already a lot", so, lets find out our lowest time possible.
Another solution:
Cache a chunk of raw data
Use fread to read a cache of that data
Read line by line
Read line by line from the cache until end of cache or end of data found
Read next chunk and repeat
Grab the un processed last part of the chunk (the one you were looking for the line delimiter) and move it at the front, then reads a chunk of the size you had defined minus the size of the unprocessed data and put it just after that un processed chunk, then, there you go, you have a new complete chunk.
Repeat the read by line and this process until the file is read completely.
You should use a cache chunk bigger than any expected size of line.
The bigger the cache size the faster you read, but the more memory you use.