Parsing Large Text Files with PHP Without Killing

I'm trying to read some large text files (between 50M-200M), doing simple text replacement (Essentially the xml I have hasn't been properly escaped in a few, regular cases). Here's a simplified version of the function:

<?php
function cleanFile($file1, $file2) {
$input_file     = fopen($file1, "r");
$output_file    = fopen($file2, "w");
  while (!feof($input_file)) {
    $buffer = trim(fgets($input_file, 4096));
    if (substr($buffer,0, 6) == '<text>' AND substr($buffer,0, 15) != '<text><![CDATA[')
    {
      $buffer = str_replace('<text>', '<text><![CDATA[', $buffer);
      $buffer = str_replace('</text>', ']]></text>', $buffer);
    }
   fputs($output_file, $buffer . "\n");
  }
  fclose($input_file);
  fclose($output_file);     
}
?>

What I don't get is that for the largest of files, around 150mb, PHP memory usage goes off the chart (around 2GB) before failing. I thought that this was the most memory efficient way to go about reading large files. Is there some method I am missing that would be more efficient for memory? Perhaps some setting that's keeping things in memory when it should be being collected?

In other words, it's not working and I don't know why, and as far as I know I am not doing things incorrectly. Any direction for me to go? Thanks for any input.

标签： php memory parsing

3条回答

姐就是有狂的资本

2楼-- · 2019-04-15 00:45

PHP isn't really designed for this. Offload the work to a different process and call it or start it from PHP. I suggest using Python or Perl.

0人赞添加讨论(0) 举报

We Are One

3楼-- · 2019-04-15 00:47

From my meagre understanding of PHP's garbage collection, the following might help:

unset $buffer when you are done writing it out to disk, explicitly telling the GC to clean it up.
put the if block in another function, so the GC runs when that function exits.

The reasoning behind these recommendations is I suspect the garbage collector is not freeing up memory because everything is done inside a single function, and the GC is garbage.

0人赞添加讨论(0) 举报

小情绪 Triste *

4楼-- · 2019-04-15 00:55

I expect this to fail in many cases. You are reading in chunks of 4096 bytes. Who knows that the cut-off will not be in the middle of a <text>? In which case your str_replace would not work.

Have you considered using a regular expression?

0人赞添加讨论(0) 举报

Parsing Large Text Files with PHP Without Killing

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间