I have an SP in SQL Server which runs hundreds of times a minute, and needs to check incoming traffic against a database. At the moment it does the following
INSERT INTO table
SELECT @value1,@value2 WHERE NOT EXISTS
(SELECT * FROM table WHERE value1 = @value1 AND value2 = @value2);
However, I could also go with
IF NOT EXISTS(SELECT * FROM table WHERE value1 = @value1 AND value2 = @value2)
INSERT INTO table (value1,value2) VALUES (@value1,@value2);
Which would be faster? I get the feeling there's not much difference between them but I'm historically not very good at TSQL... =/
UPDATE: Whoops... meant to state that the EXISTS uses more than 1 value to find if a record exists, so a unique constraint won't work. Edited the sample to reflect that...
just do it, and ignore any error (assumes an unique constraint on Value)...
Since this runs hundreds of times a minute, locking hints should be added to the SELECTs and a transaction to avoid a race condition
however, my proposed idea of just INSERT and ignore any duplicate constraint error would avoid a race condition as well.
In a hardly concurrent environment, a concurrent
INSERT
can happen in betweenIF NOT EXISTS
andINSERT
in your second query.Your first query will place the shared locks on the record it examines, which will not be lifted until the end of the query, so it will be impossible to insert a new record until the query is running.
However, you should not rely solely on this behavior. Place an additional
UNIQUE
constraint on thevalue
.It will not only make the database more consistent, but will create an index which will make the first query more fast.
If you want the values to be unique, why not just create a unique constraint on the value, do an INSERT without SELECT and gracefully handle constraint violation error?
That'd be faster than either of these approaches.
Also, your first approach doesn't work - by the time you get to select, you already inserted the value so select will obviously find what you just inserted.
If I had to guess, I would guess the second option would be faster. sql server would not have to do any kind of setup for the insert if the exists fails, whereas in the first one, it might look up some table and field names and prepare for an insert that never happens. However, I would try it in the query analyzer and see what the plan says.
Both variants are incorrect. You will insert pairs of duplicate @value1, @value2, guaranteed.
The correct way to handle this is to enforce a unique constraint on two columns and to always INSERT and handle the constraint violation:
and to insert:
While these may seem complicated, one has to factor in a little detail named correctness. This is by far simpler when compared with a lock hints based solution. This is also the most performant solution: does only one seek. All other solutions need at least two seeks (one to validate that it can be inserted, one to insert).
After adding a gazillion comments on this question and its answers, I will have my own go on answering it.
I would not expect any major difference in performance between the two proposed proposed in the original question. On one hand, as pointed out by Ray, the second approach might save you from doing some preparations for the insert, but on the other hand, an RDBMS usually performs best with batch statements, as in the first solution.
KM and DVK suggest adding a
UNIQUE
constraint, which will make the uniqueness test implicit, but will require you to add some kind of error handling around yourINSERT
statement. I have a hard time spotting why this should add any additional performance, assuming that you already have an index covering the two columns. If you do not have such index, add it, and reconsider your need for more performance.Whether the uniqueness check is performed explicit or implicit should not matter AFAIK. If anything is gained by having the check done "inside" the stomach of the DBMS, that gain might just be eaten up by overhead associated with raising and handling errors when duplicates exists.
The bottom line: Assuming an index is already in place, if you still find yourself lusting for performance, my recommendation is that you perform empirical tests on the three suggested solutions. Cook up a small program that simulates the expected input data, and blow each of the three solutions away with a few billion rows, including a plausible amount of duplicates. do this, be sure to post your results :-)