Parsing Large Text Files with PHP Without Killing

2019-04-15 00:24发布

I'm trying to read some large text files (between 50M-200M), doing simple text replacement (Essentially the xml I have hasn't been properly escaped in a few, regular cases). Here's a simplified version of the function:

<?php
function cleanFile($file1, $file2) {
$input_file     = fopen($file1, "r");
$output_file    = fopen($file2, "w");
  while (!feof($input_file)) {
    $buffer = trim(fgets($input_file, 4096));
    if (substr($buffer,0, 6) == '<text>' AND substr($buffer,0, 15) != '<text><![CDATA[')
    {
      $buffer = str_replace('<text>', '<text><![CDATA[', $buffer);
      $buffer = str_replace('</text>', ']]></text>', $buffer);
    }
   fputs($output_file, $buffer . "\n");
  }
  fclose($input_file);
  fclose($output_file);     
}
?>

What I don't get is that for the largest of files, around 150mb, PHP memory usage goes off the chart (around 2GB) before failing. I thought that this was the most memory efficient way to go about reading large files. Is there some method I am missing that would be more efficient for memory? Perhaps some setting that's keeping things in memory when it should be being collected?

In other words, it's not working and I don't know why, and as far as I know I am not doing things incorrectly. Any direction for me to go? Thanks for any input.

3条回答
姐就是有狂的资本
2楼-- · 2019-04-15 00:45

PHP isn't really designed for this. Offload the work to a different process and call it or start it from PHP. I suggest using Python or Perl.

查看更多
We Are One
3楼-- · 2019-04-15 00:47

From my meagre understanding of PHP's garbage collection, the following might help:

  1. unset $buffer when you are done writing it out to disk, explicitly telling the GC to clean it up.
  2. put the if block in another function, so the GC runs when that function exits.

The reasoning behind these recommendations is I suspect the garbage collector is not freeing up memory because everything is done inside a single function, and the GC is garbage.

查看更多
小情绪 Triste *
4楼-- · 2019-04-15 00:55

I expect this to fail in many cases. You are reading in chunks of 4096 bytes. Who knows that the cut-off will not be in the middle of a <text>? In which case your str_replace would not work.

Have you considered using a regular expression?

查看更多
登录 后发表回答