First of all, some background.
We have an order processing system, where staff enter billing data about orders in an app that stores it in a sql server 2000 database. This database isn't the real billing system: it's just a holding location so that the records can be run into a mainframe system via a nightly batch process.
This batch process is a canned third party package provided by an outside vendor. Part of what it's supposed to do is provide a report for any records that were rejected. The reject report is worked manually.
Unfortunately, it turns out the third party software doesn't catch all the errors. We have separate processes that pull back the data from the mainframe into another table in the database and load the rejected charges into yet another table.
An audit process then runs to make sure everything that was originally entered by the staff can be accounted for somewhere. This audit takes the form of an sql query we run, and it looks something like this:
SELECT *
FROM [StaffEntry] s with (nolock)
LEFT JOIN [MainFrame] m with (nolock)
ON m.ItemNumber = s.ItemNumber
AND m.Customer=s.Customer
AND m.CustomerPO = s.CustomerPO -- purchase order
AND m.CustPORev = s.CustPORev -- PO revision number
LEFT JOIN [Rejected] r with (nolock) ON r.OrderID = s.OrderID
WHERE s.EntryDate BETWEEN @StartDate AND @EndDate
AND r.OrderID IS NULL AND m.MainFrameOrderID IS NULL
That's heavily modified, of course, but I believe the important parts are represented. The problem is that this query is starting to take too long to run, and I'm trying to figure out how to speed it up.
I'm pretty sure the problem is the JOIN from the StaffEntry
table to the MainFrame
table. Since both hold data for every order since the beginning of time (2003 in this system), they tend to be a little large. The OrderID
and EntryDate
values used in the StaffEntry
table are not preserved when imported to the mainframe, which is why that join is a little more complicated. And finally, since I'm looking for records in the MainFrame
table that don't exist, after doing the JOIN we have that ugly IS NULL
in the where clause.
The StaffEntry
table is indexed by EntryDate (clustered) and separately on Customer/PO/rev. MainFrame
is indexed by customer and the mainframe charge number (clustered, this is needed for other systems) and separately by customer/PO/Rev. Rejected
is not indexed at all, but it's small and testing shows it's not the problem.
So, I'm wondering if there is another (hopefully faster) way I can express that relationship?
Before you even start looking at changing your query, you should ensure that all tables have a clustered index that makes sense for both this query and all other vital queries. Having clustered indexes on your tables i vital in sql server to ensure proper performance.
First off, you can get rid of the second LEFT JOIN.
Your WHERE was removing out any matches, anyhow... For instance, if S.OrderID was 1 and there was a R.OrderID with a value of 1, the IS NULL enforcement in the WHERE wouldn't allow it. So it'll only return records where s.OrderID IS NULL, if I'm reading it correctly...
Secondly, if you're dealing with a large amount of data, adding on a NOLOCK table hint typically won't hurt. Assuming you don't mind the possibility of a dirty-read here or there :-P Usually worth the risk, though.
Lastly, there was a part of your question which wasn't too clear for me...
Ok... But are you trying to limit it to just where those MainFrame table records don't exist? If so, you'll want that expressed in the WHERE as well, right? So something like this...
If that's what you were intending with the original statement, perhaps you can get rid of the s.OrderID IS NULL check?
In addition to what Kasperjj has suggested (which I do agree should be first), you might consider using temp tables to restrict the amount of data. Now, I know, I know that everyone says to stay away from temp tables. And i Usually do but sometimes, it is worth giving it a try because you can shrink the amount of data to join drastically with this method; this makes the overall query faster. (of course this does depend on how much you can shrink the result sets.)
My final thought is sometimes you will just need to experiment with different methods of pulling together the query. There might be too many variables for anyone here to give a answer.... On the other hand, people here are smart so I could be wrong.
Best of luck!
Regards, Frank
PS: I forgot to mention that if you wanted to try this temp table method, you'd also need to experiment with different indexes and primary keys on the temp tables. Depending on the amount of data, indexes and PKs can help.
This doesn't make sense:
if
s.OrderID IS NULL
, thenr.OrderID = s.OrderID
will never be true, so no rows from[Rejected]
will ever be included, thus as given, it is equivalent to:Are you sure that code you posted is right?
Indexing on all the tables is going to be important. If you can't do much with the indexing on the [MainFrame] columns used in the join, you can also pre-limit the rows to be searched in [MainFrame] (and [Rejected], although that already looks like it has a PK)by specifying a date range - if the window of date should be roughly similar. This can cut down on the right hand side on that join.
I would also look at the execution plan and also do a simple black box evaluation of which of your
JOIN
s is really the most expensive -m
orr
, by benchmarking the query with only one or the other. I would suspect it ism
because of the multiple columns and missing useful indexes.You could use m.EntryDate within a few days or months of your range. But if you already have indexes on Mainframe, the question is why aren't they being used, or if they are being used, why is the performance so slow.
try changing LEFT JOIN [Rejected] r with (nolock) ON r.OrderID = s.OrderID into the RIGHT MERGE JOIN: