Best practices for importing large CSV files

2019-01-23 02:43发布

站内文章 / 后端开发

32 0

啃猪蹄的小仙女

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

My company gets a set of CSV files full of bank account info each month that I need to import into a database. Some of these files can be pretty big. For example, one is about 33MB and about 65,000 lines.

Right now I have a symfony/Doctrine app (PHP) that reads these CSV files and imports them into a database. My database has about 35 different tables and on the process of importing, I take these rows, split them up into their constituent objects and insert them into the database. It all works beautifully, except it's slow (each row takes about a quarter second) and it uses a lot of memory.

The memory use is so bad that I have to split up my CSV files. A 20,000-line file barely makes it in. By the time it's near the end, I'm at like 95% memory usage. Importing that 65,000 line file is simply not possible.

I've found symfony to be an exceptional framework for building applications and I normally wouldn't consider using anything else, but in this case I'm willing to throw all my preconceptions out the window in the name of performance. I'm not committed to any specific language, DBMS, or anything.

Stack Overflow doesn't like subjective questions so I'm going to try to make this as un-subjective as possible: for those of you have not just an opinion but experience importing large CSV files, what tools/practices have you used in the past that have been successful?

For example, do you just use Django's ORM/OOP and you haven't had any problems? Or do you read the entire CSV file into memory and prepare a few humongous INSERT statements?

Again, I want not just an opinion, but something that's actually worked for you in the past.

Edit: I'm not just importing an 85-column CSV spreadsheet into one 85-column database table. I'm normalizing the data and putting it into dozens of different tables. For this reason, I can't just use LOAD DATA INFILE (I'm using MySQL) or any other DBMS's feature that just reads in CSV files.

Also, I can't use any Microsoft-specific solutions.

回答1:

I had this exact same problem about 2 weeks ago. I wrote some .NET to do ROW BY ROW inserts and by my calculations with the amount of data I had, it would take around a week to this it this way.

So instead I used a string builder to create one HUGE query and sent it to my relational system all at once. It went from taking a week to taking 5 minutes. Now I don't know what relational system you are using, but with enormous queries you'll probably have to tweak your max_allowed_packet param or similar.

回答2:

Forgive me if I'm not exactly understanding your issue correctly, but it seems like you're just trying to get a large amount of CSV data into a SQL database. Is there any reason why you want to use a web app or other code to process the CSV data into INSERT statements? I've had success importing large amounts of CSV data into SQL Server Express (free version) using SQL Server Management Studio and using BULK INSERT statements. A simple bulk insert would look like this:

BULK INSERT [Company].[Transactions]
    FROM "C:\Bank Files\TransactionLog.csv"
    WITH
    (
        FIELDTERMINATOR = '|',
        ROWTERMINATOR = '\n',
        MAXERRORS = 0,
        DATAFILETYPE = 'widechar',
        KEEPIDENTITY
    )
GO

回答3:

First: 33MB is not big. MySQL can easily handle data of this size.

As you noticed, row-by-row insertion is slow. Using an ORM on top of that is even slower: there's overhead for building objects, serialization, and so on. Using an ORM to do this across 35 tables is even slower. Don't do this.

You can indeed use LOAD DATA INFILE; just write a script that transforms your data into the desired format, separating it into per-table files in the process. You can then LOAD each file into the proper table. This script can be written in any language.

Aside from that, bulk INSERT (column, ...) VALUES ... also works. Don't guess what your row batch size should be; time it empirically, as the optimal batch size will depend on your particular database setup (server configuration, column types, indices, etc.)

Bulk INSERT is not going to be as fast as LOAD DATA INFILE, and you'll still have to write a script to transform raw data into usable INSERT queries. For this reason, I'd probably do LOAD DATA INFILE if at all possible.

回答4:

FWIW the following steps caused a huge speedup of my LOAD DATA INFILE:

SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET SESSION tx_isolation='READ-UNCOMMITTED';
SET sql_log_bin = 0;
#LOAD DATA LOCAL INFILE....
SET UNIQUE_CHECKS = 1;
SET FOREIGN_KEY_CHECKS = 1;
SET SESSION tx_isolation='READ-REPEATABLE';

See article here

回答5:

You can use Mysql LOAD DATA INFILE statemnt, it allows you to read data from a text file and import the file's data into a database table very fast..

LOAD DATA INFILE '/opt/lampp/htdocs/sample.csv' INTO TABLE discounts FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS (title,@expired_date,discount) SET expired_date = STR_TO_DATE(@expired_date, '%m/%d/%Y');

for more info: http://dev.mysql.com/doc/refman/5.5/en/load-data.html and http://www.mysqltutorial.org/import-csv-file-mysql-table/

回答6:

If you are using Sql Server and have access to .NET then you can write a quick application to use the SQLBulkCopy class. I've used this in previous projects to get a lot of data into SQL very quickly. The SQLBulkCopy class makes use of SQL Server's BCP, so if you're using something other than .NET it may be worth looking into whether that option is open to you too. Not sure if you're using a DB other than SQL Server.

回答7:

I don't like some of the other answers :)

I used to do this at a job.

You write a program to create a big SQL script full of INSERT statements, one per line. Than you run the script. You can save the script for future reference (cheap log). Use gzip and it will shrink the size like 90%.

You don't need any fancy tools and it really doesn't matter what database you are using.

You can do a few hundred Inserts per transaction or all of them in one transaction, it's up to you.

Python is a good language for this, but I'm sure php is fine too.

If you have performance problems some databases like Oracle have a special bulk loading program which is faster than INSERT statements.

You should run out of memory cause you should only be parsing one line at a time. You have no need to hold the whole thing in memory, don't do that!

回答8:

I am reading a CSV file which has close to 1M records and 65 columns. Each 1000 record processed in PHP, there is one big fat MySQL statement that goes into the database. The writing takes no time at all. It's the parsing that does. The memory used to process this uncompressed 600MB file is about 12 MB.

回答9:

I need to do this too from time to time (import large non-standardized CSVs where each row creates a dozen or so related DB objects) so I wrote a python script where I can specify what goes where and how it's all related. The script then simply generates INSERT statements.

Here it is: csv2db

Disclaimer: I'm basically a noob when it comes to databases, so there might be better ways to accomplish this.

回答10:

You can use generator for memory efficient file ready. The small snippet below might help you.

#Method
public function getFileRecords($params)
{
    $fp = fopen('../' . $params['file'] . '.csv', 'r');
    //$header = fgetcsv($fp, 1000, ','); // skip header

    while (($line = fgetcsv($fp, 1000, ',')) != FALSE) {
        $line = array_map(function($str) {
            return str_replace('\N', '', $str);
        }, $line);

        yield $line;
    }

    fclose($fp);

    return;
}

#Implementation
foreach ($yourModel->getFileRecords($params) as $row) {
    // you get row as an assoc array;
    $yourModel->save($row);
}