First off let me say I am running on SQL Server 2005 so I don't have access to MERGE
.
I have a table with ~150k rows that I am updating daily from a text file. As rows fall out of the text file I need to delete them from the database and if they change or are new I need to update/insert accordingly.
After some testing I've found that performance wise it is exponentially faster to do a full delete and then bulk insert from the text file rather than read through the file line by line doing an update/insert. However I recently came across some posts discussing mimicking the MERGE
functionality of SQL Server 2008 using a temp table and the output of the UPDATE
statement.
I was interested in this because I am looking into how I can eliminate the time in my Delete/Bulk Insert method when the table has no rows. I still think that this method will be the fastest so I am looking for the best way to solve the empty table problem.
Thanks
I think your fastest method would be to:
- Drop all foreign keys and indexes
from your table.
- Truncate your
table.
- Bulk insert your data.
- Recreate your foreign keys and
indexes.
Is the problem that Joe's solution is not fast enough, or that you can not have any activity against the target table while your process runs? If you just need to prevent users from running queries against your target table, you should contain your process within a transaction block. This way, when your TRUNCATE TABLE executes, it will create a table lock that will be held for the duration of the transaction, like so:
begin tran;
truncate table stage_table
bulk insert stage_table
from N'C:\datafile.txt'
commit tran;
An alternative solution which would satsify your requirement for not having "down time" for the table you are updating.
It sounds like originally you were reading the file and doing an INSERT/UPDATE/DELETE 1 row at a time. A more performant approach than that, that does not involve clearing down the table is as follows:
1) bulk load the file into a new, separate table (no indexes)
2) then create the PK on it
3) Run 3 statements to update the original table from this new (temporary) table:
DELETE rows in the main table that don't exist in the new table
UPDATE rows in the main table where there is a matching row in the new table
INSERT rows into main table from the new table where they don't already exist
This will perform better than row-by-row operations and should hopefully satisfy your overall requirements
There is a way to update the table with zero downtime: keep two day's data in the table, and delete the old rows after loading the new ones!
- Add a DataDate column representing the date for which your ~150K rows are valid.
- Create a one-row, one-column table with "today's" DataDate.
- Create a view of the two tables that selects only rows matching the row in the DataDate table. Index it if you like. Readers will now refer to this view, not the table.
- Bulk insert the rows. (You'll obviously need to add the DataDate to each row.)
- Update the DataDate table. View updates Instantly!
- Delete yesterday's rows at your leisure.
SELECT
performance won't suffer; joining one row to 150,000 rows along the primary key should present no problem to any server less than 15 years old.
I have used this technique often, and have also struggled with processes that relied on sp_rename
. Production processes that modify the schema are a headache. Don't.
For raw speed, I think with ~150K rows in the table, I'd just drop the table, recreate it from scratch (without indexes) and then bulk load afresh. Once the bulk load has been done, then create the indexes.
This assumes of course that having a period of time when the table is empty/doesn't exist is acceptable which it does sound like could be the case.