fastest Import of a csv to a database table

2019-05-25 06:47发布

问题:

I have implemented an import functionality which takes data from a csv file in an Asp.Net appication. The file of the size can vary from a few kb's to a max of 10 MB.

However when an import occurs and if the file size is > 50000 it takes around 20 MINS . Which is way too much of a time. I need to perform an import for around 300000 records within a timespan of 2-3 Mins .

I know that the import to a database also depends on the physical memory of the db server .I create insert scripts in bulk and execute . I also know using SqlBulkCopy would also be another option but in my case its just not the inserting of product's that take place but also update and delete that is a field called "FUNCTION CODE" which decides whether to Insert,Update Or Delete.

Any suggestions regarding as to how to go about this would be greatly appreciated.

One approach towards this would be to implement multiple threads which carry out processes simultaneosly ,but i have never implemented threading till date and hence am not aware of the complication i would incur by implementing the same.

Thanks & Regards, Francis P.

回答1:

SqlBulkCopy is definitely going to be fastest. I would approach this by inserting the data into a temporary table on the database. Once the data is in the temp table, you could use SQL to merge/insert/delete accordingly.



回答2:

I guess you are using SQL Server...

If you are using 2005/2008 consider using SSIS to process the file. Technet

Importing huge amount of data within the asp.net process is not the best thing you can do. You might upload the file and start a process that is doing the magic for you.



回答3:

If this is a repeated process and the file is uploaded via asp.net plus you are doing some decision making on the data to decide insert/update or delete, then try out http://www.codeproject.com/KB/database/CsvReader.aspx it is this fast csv reader. Its quite quick and economical with memory



回答4:

You are doing all your database queries with 1 connection sequentially. So for every insert/update/delete you are sending the command through the wire, wait for the db to do it's thing, and then wake up again when something is sent back.

Databases are optimized for heavy parallel access. So there are 2 easy routes for a significant speedup:

Open X connections to the database (where you have to tweak X but just start with 5) and either: spin up 5 threads who each do a chunk of the same work you were doing. or: use asynchronous calls and when a callback arrives shoot in the next query.



回答5:

I suggest using the XML functionality in SQL Server 2005/2008, which will allow you to bulk insert and bulk update. I'd take the following approach:

  • Process the entire file into an in-memory data structure.
  • Create a single XML document from this structure to pass to a stored proc.
  • Create a stored proc to load data from the XML document into a temporary table, then perform the inserts and updates. See below for guidance on creating the stored proc.

There are numerous advantages to this approach:

  • The entire operation is completed in one database call, although if your dataset is really large you may want to batch it.
  • You can wrap all the database writes into a single transaction easily, and roll back if anything fails.
  • You are not using any dynamic SQL, which could have posed a security risk.
  • You can return the IDs of the inserted, updated and/or deleted records using the OUTPUT clause.

In terms of the stored proc you will need, something like the following should work:

CREATE PROCEDURE MyBulkUpdater
(
   @p_XmlData VARCHAR(MAX)
)

AS
DECLARE @hDoc INT
EXEC sp_xml_preparedocument @hDoc OUTPUT, @p_XmlData

-- Temporary table, should contain the same schema as the table you want to update
CREATE TABLE #MyTempTable
(
   -- ...
)

INSERT INTO #MyTempTable
(
        [Field1],
        [Field2]
)
SELECT
        XMLData.Field1,
        XMLData.Field2
FROM OPENXML (@hdoc, 'ROOT/MyRealTable', 1)
WITH
(
        [Field1] int,
        [Field2] varchar(50),
        [__ORDERBY] int
) AS XMLData

EXEC sp_xml_removedocument @hDoc

Now you can simply insert, update and delete your real table from your temporary table as required eg

INSERT INTO MyRealTable (Field1, Field2)
SELECT Field1, Field2
FROM #MyTempTable
WHERE ...

UPDATE MyRealTable
SET rt.Field2 = tt.Field2
FROM MyRealTable rt
JOIN MyTempTable tt ON tt.Field1 = MyRealTable.Field1
WHERE ...

For an example of the XML you need to pass in, you can do:

SELECT TOP 1 *, 0 AS __ORDERBY FROM MyRealTable AS MyRealTable FOR XML AUTO, ROOT('ROOT') 

For more info, see OPENXML, sp_xml_preparedocument and sp_xml_removedocument.