Process very big csv file without timeout and memo

2019-01-10 13:35发布

问题:

At the moment I'm writing an import script for a very big CSV file. The Problem is most times it stops after a while because of an timeout or it throws an memory error.

My Idea was now to parse the CSV file in "100 lines" steps and after 100 lines recall the script automatically. I tried to achieve this with header (location ...) and pass the current line with get but it didn't work out as I want to.

Is there a better way to this or does someone have an idea how to get rid of the memory error and the timeout?

回答1:

I've used fgetcsv to read a 120MB csv in a stream-wise-manner (is that correct english?). That reads in line by line and then I've inserted every line into a database. That way only one line is hold in memory on each iteration. The script still needed 20 min. to run. Maybe I try Python next time… Don't try to load a huge csv-file into an array, that really would consume a lot of memory.

// WDI_GDF_Data.csv (120.4MB) are the World Bank collection of development indicators:
// http://data.worldbank.org/data-catalog/world-development-indicators
if(($handle = fopen('WDI_GDF_Data.csv', 'r')) !== false)
{
    // get the first row, which contains the column-titles (if necessary)
    $header = fgetcsv($handle);

    // loop through the file line-by-line
    while(($data = fgetcsv($handle)) !== false)
    {
        // resort/rewrite data and insert into DB here
        // try to use conditions sparingly here, as those will cause slow-performance

        // I don't know if this is really necessary, but it couldn't harm;
        // see also: http://php.net/manual/en/features.gc.php
        unset($data);
    }
    fclose($handle);
}


回答2:

I find uploading the file and inserting using mysql's LOAD DATA LOCAL query a fast solution eg:

    $sql = "LOAD DATA LOCAL INFILE '/path/to/file.csv' 
        REPLACE INTO TABLE table_name FIELDS TERMINATED BY ',' 
        ENCLOSED BY '\"' LINES TERMINATED BY '\r\n' IGNORE 1 LINES";
    $result = $mysqli->query($sql);


回答3:

If you don't care about how long it takes and how much memory it needs, you can simply increase the values for this script. Just add the following lines to the top of your script:

ini_set('memory_limit', '512M');
ini_set('max_execution_time', '180');

With the function memory_get_usage() you can find out how much memory your script needs to find a good value for the memory_limit.

You might also want to have a look at fgets() which allows you to read a file line by line. I am not sure if that takes less memory, but I really think this will work. But even in this case you have to increase the max_execution_time to a higher value.



回答4:

There seems to be an enormous difference between fgetcsv() and fgets() when it comes to memory consumption. A simple CSV with only one column passed my 512M memory limit for just 50000 records with fgetcsv() and took 8 minutes to report that.

With fgets() it took only 3 minutes to successfully process 649175 records, and my local server wasn't even gasping for additional air..

So my advice is to use fgets() if the number of columns in your csv is limited. In my case fgets() returned directly the string inside column 1. For more then one column, you might use explode() in a disposable array which you unset() after each record operation. Thumbed up answer 3 @ndkauboy



回答5:

Oh. Just make this script called as CLI, not via silly web interface. So, no execution time limit will affect it.
And do not keep parsed results forever but write them down immediately - so, you won't be affected by memory limit either.