I'm trying to read some large text files (between 50M-200M), doing simple text replacement (Essentially the xml I have hasn't been properly escaped in a few, regular cases). Here's a simplified version of the function:
<?php
function cleanFile($file1, $file2) {
$input_file = fopen($file1, "r");
$output_file = fopen($file2, "w");
while (!feof($input_file)) {
$buffer = trim(fgets($input_file, 4096));
if (substr($buffer,0, 6) == '<text>' AND substr($buffer,0, 15) != '<text><![CDATA[')
{
$buffer = str_replace('<text>', '<text><![CDATA[', $buffer);
$buffer = str_replace('</text>', ']]></text>', $buffer);
}
fputs($output_file, $buffer . "\n");
}
fclose($input_file);
fclose($output_file);
}
?>
What I don't get is that for the largest of files, around 150mb, PHP memory usage goes off the chart (around 2GB) before failing. I thought that this was the most memory efficient way to go about reading large files. Is there some method I am missing that would be more efficient for memory? Perhaps some setting that's keeping things in memory when it should be being collected?
In other words, it's not working and I don't know why, and as far as I know I am not doing things incorrectly. Any direction for me to go? Thanks for any input.
PHP isn't really designed for this. Offload the work to a different process and call it or start it from PHP. I suggest using Python or Perl.
From my meagre understanding of PHP's garbage collection, the following might help:
unset
$buffer
when you are done writing it out to disk, explicitly telling the GC to clean it up.if
block in another function, so the GC runs when that function exits.The reasoning behind these recommendations is I suspect the garbage collector is not freeing up memory because everything is done inside a single function, and the GC is garbage.
I expect this to fail in many cases. You are reading in chunks of 4096 bytes. Who knows that the cut-off will not be in the middle of a
<text>
? In which case yourstr_replace
would not work.Have you considered using a regular expression?