I'm building a system for updating large amounts of data through various CSV feeds. Normally I would just loop though each row in the feed, do a select query to check if the item already exists and insert/update an item depending if it exists or not.
I feel this method isn't very scalable and could hammer the server on larger feeds. My solution is to loop through the items as normal but store them in memory. Then for every 100 or so items do a select on those 100 items and get a list of existing items in the database that match. Then concatenate the insert/update statements together and run them into the database. This would essentially cut down on the trips to the database.
Is this a scalable enough solution and are there any example tutorials on importing large feeds into a productive environment?
Thanks
One way is load your CSV into a DataTable (or more likely a DataReader) and then batch slam in the results using SqlBulkCopy -
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
Its pretty efficient and you can do some column mapping. Tip - when you map columns using SqlBulkCopy they are case sensitive.
Your way is the worst possible solution. In general, you should not think in terms of looping through records individually. We used to have a company built import tool that loops through records, it would take 18-20 hours to load a file with over a million records (something that wasn't a frequent occurrence when it was built but which is a many times a day occurrence now).
I see two options: First use bulk insert to load to a staging table and do whatever clean up you need to do on that table. How are you determining if the record already exists? You should be able to build a set-based update by joining to the staging table on those fields which determine the update. Often I have a added a column to my staging table for the id of the record it matches to and populated that through a query then done the update. Then you do an insert of the records which don't have a corresponding id. If you have too many records to do all at once, you may want to run in batches (which yes is a loop), but make the batches considerably larger than 1 record at a time (I usually start with 2000 and then based on the time it takes for that determine if I can do more or less in the batch).
I think 2008 also has a merge statement but I have not yet had a chance to use it. Look it up in books online.
The alternative is to use SSIS which is optimized for speed. SSIS is a complex thing though and the learning curve is steep.
Seeing that you're using SQL Server 2008, I would recommend this approach:
Check out the MSDN docs and a great blog post on how to use the MERGE command.
Basically, you create a link between your actual data table and the staging table on a common criteria (e.g. a common primary key), and then you can define what to do when
You would have a
MERGE
statement something like this:Of course, the
ON
clause can be much more involved if needed. And of course, yourWHEN
statements can also be more complex, e.g.and so forth.
MERGE
is a very powerful and very useful new command in SQL Server 2008 - use it, if you can!Another approach would be to write a .Net stored procedure on server on the server to operate on the entire file...
Only if you need more control than Kris Krause's solution though - I'm a big fan of keeping it simple (and reusable) where we can...
Do you need to be rolling your own here at all? Would it be possible to provide the data in such a way that the SQL Server can use Bulk Import to load it in and then deal with duplicates in the database once the import is complete?
When it comes to heavy lifting with a lot of data my experience tends to be that working in the database as much as possible is much quicker and less resource intensive.