I'm having some troubles to deal with a very big table in my database. Before to talk about the problem, let's talk about what i want to achieve.
I have two source tables :
- Source 1:
SALES_MAN (ID_SMAN, SM_LATITUDE, SM_LONGITUDE)
Source 2:
CLIENT (ID_CLIENT, CLATITUDE, CLONGITUDE)
Target:
DISTANCE (ID_SMAN, ID_CLIENT, SM_LATITUDE, SM_LONGITUDE, CLATITUDE, CLONGITUDE, DISTANCE)
The idea is to find the top N nearest SALES_MAN
for every client using a ROW_NUMBER
in the target table.
What I'm doing currently is calculating the distance between every client and every sales man :
INSERT INTO DISTANCE ([ID_SMAN], [ID_CLIENT], [DISTANCE],
[SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE])
SELECT
[ID_SMAN], [ID_CLIENT],
geography::STGeomFromText('POINT('+IND_LATITUDE+' '+IND_LONGITUDE+')',4326).STDistance(geography::STGeomFromText('POINT('+DLR.[DLR_N_GPS_LATTITUDE]+' '+DLR.[DLR_N_GPS_LONGITUDE]+')',4326))/1000 as distance,
[SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE]
FROM
[dbo].[SALES_MAN], [dbo].[CLIENT]
The DISTANCE
table contains approximately 1 milliards rows.
The second step to get my 5 nearest sales man per client is to run this query :
SELECT *
FROM
(SELECT
*,
ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE) rang
FROM DISTANCE) TAB
WHERE rang < 6
The last query is really a consuming one. So to avoid the SORT operator I tried to create an sorted non clustered index in DISTANCE and ID_CLIENT but it did not work. I also tried to include all the needed columns in the both indexes.
But when I created a clustered index on DISTANCE
and keep the nonclustered sorted index in the ID_CLIENT
the things went better.
So what a nonclustered sorting index is not working in this case?
But when I use the clustered index, I have other problem in loading data and I'm kind of forced to delete it before starting the loading process.
So what do you think? And how we can deal with this kind of tables to be able to select, insert or update data without having performance issues ?
Many thanks
Too long for a comment, but consider the following points.
Item 1) Consider adding a
Geography
field to each of your source tables. This will eliminate the redundantGEOGRAPHY::Point()
function callsSo then the calculation for distance would simply be
Item 2) Rather than generating EVERY possible combination, consider a adding a condtion to the
JOIN
. Keep in mind that every "1" of Lat or Lng is approx 69 miles, so you can reduce the search area. For exampleThis
+/- 1
could be any reasonable value ... (i.e.0.5
or even2.0
)ROW_NUMBER is a window function that requires the whole rows related with the ORDER BY 's column so its better to filter your result before ROW_NUMBER,
and you've to change the following code :
into this:
and make sure you'd added indexes on both CLIENT_ID and DISTANCE columns