I need to insert a huge CSV-File into 2 Tables with a 1:n relationship within a mySQL Database.
The CSV-file comes weekly and has about 1GB, which needs to be append to the existing data. Each of them 2 tables have a Auto increment Primary Key.
I've tried:
- Entity Framework (takes most time of all approaches)
- Datasets (same)
- Bulk Upload (doesn't support multiple tables)
- MySqlCommand with Parameters (needs to be nested, my current approach)
- MySqlCommand with StoredProcedure including a Transaction
Any further suggestions?
Let's say simplified this is my datastructure:
public class User
{
public string FirstName { get; set; }
public string LastName { get; set; }
public List<string> Codes { get; set; }
}
I need to insert from the csv into this database:
User (1-n) Code
+---+-----+-----+ +---+---+-----+
|PID|FName|LName| |CID|PID|Code |
+---+-----+-----+ +---+---+-----+
| 1 |Jon | Foo | | 1 | 1 | ed3 |
| 2 |Max | Foo | | 2 | 1 | wst |
| 3 |Paul | Foo | | 3 | 2 | xsd |
+---+-----+-----+ +---+---+-----+
Here a sample line of the CSV-file
Jon;Foo;ed3,wst
A Bulk load like LOAD DATA LOCAL INFILE
is not possible because i have restricted writing rights
AFAIK the insertions done in a table are sequential while the insertions in different table can be done in parallel. Open two separate new connections to the same database and then insert in parallel maybe by using Task Parallel Library.
However, if there are integrity constraints about 1:n relationship between the tables, then:
InsertInto
method written below such that insert in second table will happen only after you are done inserting the data in the first one.Edit: Since you have requested, if there is a possibility for you to perform the inserts in parallel, following is the code template you can use.
Given the great size of data, the best approach (performance wise) is to leave as much data processing to the database and not the application.
Create a temporary table that the data from the .csv file will be temporarily saved.
Loading the data from the
.csv
to this table is pretty straightforward. I would suggest the use ofMySqlCommand
(which is also your current approach). Also, using the sameMySqlConnection
object for allINSERT
statements will reduce the total execution time.Then to furthermore process the data, you can create a stored procedure that will handle it.
Assuming these two tables (taken from your simplified example):
and
you can have the following stored procedure.
For every row in the source data, it makes an
INSERT
statement for theuser
table. Then there is aWHILE
loop to split the comma separated codes and make for each one anINSERT
statement for thecodes
table.Regarding the use of
LAST_INSERT_ID()
, it is reliable on a PER CONNECTION basis (see doc here). If the MySQL connection used to run this stored procedure is not used by other transactions, the use ofLAST_INSERT_ID()
is safe.Edit: Here is the OP's variant that omits the temp-table
imported
. Instead of inserting the data from the .csv to theimported
table, you call the SP to directly store them to your database.Note: The code to split the codes is taken from here (MySQL does not provide a split function for strings).
I developed my WPF application application using the Entity Framework and used SQL server database and needed to read data from an excel file and had to insert that data into 2 tables that has relationship between them. For roughly about 15000 rows in excel it used to take around 4 hours of time. Then what I did was I used a block of 500 rows per insert and this speeded up my insertion to unbelievalbe fast and now it takes mere 3-5 seconds to import that same data.
So I would suggest you add your rows to a Context like 100/200/500 at a time and then call the SaveChanges method (if you really want to be using EF). There are other helpful tips as well to speed up the performance for EF. Please read this for your reference.
And here is the CreateNewSheet method
CreateNewSpool is ditto same method except for the fields name and table name, because it updates a child table. But the idea is the same
1 - Add a column VirtualId to
User
table & class.EDITED 2 - Assign numbers in a loop for the VirtualId (use negative numbers starting -1 to avoid collisions in the last step) field in each
User
Object. For eachCode c
object belonging toUser u
object set thec.UserId = u.VirtualId
.3 - Bulk load Users into
User
table, Bulk load Codes intoCode
table.4-
UPDATE CODE C,USER U SET C.UserId = U.Id WHERE C.UserId = U.VirtualId.
NOTE : If you have a FK Constraint on Code.UserId you can drop it and re-add it after the Insert.
Can you break the CSV into two files?
E.g. Suppose your file has the following columns:
So one set of A might have multiple B entries. After you break it apart, you get:
Then you bulk insert them separately.
Edit: Pseudo code
Based on the conversation, something like:
When you say "efficiently" are you talking memory, or time?
In terms of improving the speed of the inserts, if you can do multiple value blocks per insert statement, you can get 500% improvement in speed. I did some benchmarks on this over in this question: Which is faster: multiple single INSERTs or one multiple-row INSERT?
My approach is described in the answer, but simply put, reading in up to say 50 "rows" (to be inserted) at once and bundling them into a single
INSERT INTO(...), VALUES(...),(...),(...)...(...),(...)
type statement seems to really speed things up. At least, if you're restricted from not being able to bulk load.Another approach btw if you have live data you can't drop indexes on during the upload, is to create a memory table on the mysql server without indexes, dump the data there, and then do an
INSERT INTO live SELECT * FROM mem
. Though that uses more memory on the server, hence the question at the start of this answer about "what do you mean by 'efficiently'?" :)Oh, and there's probably nothing wrong with iterating through the file and doing all the first table inserts first, and then doing the second table ones. Unless the data is being used live, I guess. In that case you could definitely still use the bundled approach, but the application logic to do that is a lot more complex.
UPDATE: OP requested example C# code for multivalue insert blocks.
Note: this code assumes you have a number of structures already configured:
MySqlDbType
s for each table, same order as the field names.Oh yeah, and the localcommand is MySqlCommand created by using CreateCommand() on the local MySqlConnection object.
Further note: I wrote this quite a while back when I was kind of starting. If this causes your eyes or brain to bleed, I apologise in advance :)