Handling very big table in SQL Server Performance

2019-08-02 23:49发布

问题:

I'm having some troubles to deal with a very big table in my database. Before to talk about the problem, let's talk about what i want to achieve.

I have two source tables :

  • Source 1: SALES_MAN (ID_SMAN, SM_LATITUDE, SM_LONGITUDE)
  • Source 2: CLIENT (ID_CLIENT, CLATITUDE, CLONGITUDE)

  • Target: DISTANCE (ID_SMAN, ID_CLIENT, SM_LATITUDE, SM_LONGITUDE, CLATITUDE, CLONGITUDE, DISTANCE)

The idea is to find the top N nearest SALES_MAN for every client using a ROW_NUMBER in the target table.

What I'm doing currently is calculating the distance between every client and every sales man :

INSERT INTO DISTANCE ([ID_SMAN], [ID_CLIENT], [DISTANCE],
                      [SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE])
    SELECT 
        [ID_SMAN], [ID_CLIENT],
        geography::STGeomFromText('POINT('+IND_LATITUDE+' '+IND_LONGITUDE+')',4326).STDistance(geography::STGeomFromText('POINT('+DLR.[DLR_N_GPS_LATTITUDE]+' '+DLR.[DLR_N_GPS_LONGITUDE]+')',4326))/1000 as distance,
        [SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE]
    FROM 
        [dbo].[SALES_MAN], [dbo].[CLIENT]

The DISTANCE table contains approximately 1 milliards rows.

The second step to get my 5 nearest sales man per client is to run this query :

SELECT * 
FROM  
    (SELECT 
         *, 
         ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE) rang  
     FROM DISTANCE) TAB
WHERE rang < 6 

The last query is really a consuming one. So to avoid the SORT operator I tried to create an sorted non clustered index in DISTANCE and ID_CLIENT but it did not work. I also tried to include all the needed columns in the both indexes.

But when I created a clustered index on DISTANCE and keep the nonclustered sorted index in the ID_CLIENT the things went better.

So what a nonclustered sorting index is not working in this case?

But when I use the clustered index, I have other problem in loading data and I'm kind of forced to delete it before starting the loading process.

So what do you think? And how we can deal with this kind of tables to be able to select, insert or update data without having performance issues ?

Many thanks

回答1:

Too long for a comment, but consider the following points.

Item 1) Consider adding a Geography field to each of your source tables. This will eliminate the redundant GEOGRAPHY::Point() function calls

Update YourTable Set GeoPoint = GEOGRAPHY::Point([Lat], [Lng], 4326)

So then the calculation for distance would simply be

  ,InMeters  = C.GeoPoint.STDistance(S.GeoPoint) 
  ,InMiles   = C.GeoPoint.STDistance(S.GeoPoint) / 1609.344


Item 2) Rather than generating EVERY possible combination, consider a adding a condtion to the JOIN. Keep in mind that every "1" of Lat or Lng is approx 69 miles, so you can reduce the search area. For example

From CLIENT C
Join SALES_MAN S
  on S.Lat between C.Lat-1 and C.Lat+1
 and S.Lng between C.Lng-1 and C.Lng+1

This +/- 1 could be any reasonable value ... (i.e. 0.5 or even 2.0)



回答2:

ROW_NUMBER is a window function that requires the whole rows related with the ORDER BY 's column so its better to filter your result before ROW_NUMBER,

and you've to change the following code :

SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE) 
rang FROM DISTANCE
) TAB

WHERE rang < 6 

into this:

WITH DISTANCE_CLIENT_IDS (CLIENT_ID) AS 
(
  SELECT DISTINCT CLIENT_ID 
  FROM DISTANCE
) 

SELECT Dx.* 
FROM DISTANCE_CLIENT_IDS D1,
(
   SElECT * , ROW_NUMBER(ORDER BY DISTANCE) RANGE 
   FROM (
     SELECT TOP(5) * 
     FROM DISTANCE D2
     WHERE D1.CLIENT_ID = D2.CLIENT_ID
   ) Dt
) Dx

and make sure you'd added indexes on both CLIENT_ID and DISTANCE columns