SQL Server, ISABOUT, weighted terms

2019-06-17 08:39发布

问题:

I am trying to figure out exactly how weighted terms work in an ISABOUT query in SQL SERVER.

Here is where I currently am:

Each query returns the following rows:

QUERY 1 (weight 1): Initial ranking

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (1) ) ') ORDER BY RANK DESC, [KEY]

KEY     RANK
306342  249
272619  156
221557  114

QUERY 2 (weight 0.8): Ranking increases, initial order is preserved

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (0.8) ) ') ORDER BY RANK DESC, [KEY]

 KEY     RANK
 306342  321
 272619  201
 221557  146

QUERY 3 (weight 0.2): Ranking increases, initial order is preserved

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (0.2) ) ') ORDER BY RANK DESC, [KEY]

 KEY    RANK
 306342 998
 272619 877
 221557 692

QUERY 4 (weight 0.17): Ranking decreases, best match is now last, inverted behavior for these terms begin at 0.17

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (0.17) ) ') ORDER BY RANK DESC, [KEY]

 KEY      RANK
 272619   960
 221557   958
 306342   802

QUERY 5 (weight 0.16): Ranking increases, best match is now second

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (0.17) ) ') ORDER BY RANK DESC, [KEY]

 KEY      RANK
 272619   978
 306342   935
 221557   841

QUERY 6 (weight 0.01): Ranking decreases, best match is last again

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (0.01) ) ') ORDER BY RANK DESC, [KEY]

 KEY    RANK
 221557 105
 272619 77
 306342 50

Best match for weight 1 has a rank of 249 and while weight goes down to 0.2 ranking of best match increases to 998. From 0.2 to 0.17 ranking decreases and from 0.16 results are inverted (the weight values that reproduce this behavior depend on terms and maybe on columns searched...)

It seems there is a point where weight means the opposite, something like "do not include this term". Do you have any explanation of this behavior? Why ranking increases when weight decreases? Why ranking decreases after some point until results are inverted and how can you predict this point?

I use a custom "word-breaker", when user searches for something creating the following query:

CONTAINSTABLE(documentParts, title, 
      'ISABOUT (
          "wordA wordB wordC" weight (0.8), 
          "wordA*" NEAR "wordB*" NEAR "wordC*" weight (0.6), 
          "wordA*" weight (0.1), 
          "wordB*" weight (0.1), 
          "wordC*" weight (0.1), 
       ) ')

Am I to expect big ranks for 0.1 Words?
Is the following query the same as above and am I to expect some weird behavior with the 0.1 rankings?

CONTAINSTABLE(documentParts, title, '
      ISABOUT ( "wordA wordB wordC" weight (0.8) ), 
      OR ISABOUT ( "wordA*" NEAR "wordB*" NEAR "wordC*" weight (0.6) ), 
      OR ISABOUT ( "wordA*" weight (0.1) ), 
      OR ISABOUT ( "wordB*" weight (0.1) ), 
      OR ISABOUT ( "wordC*" weight (0.1) ), 
      ')

回答1:

In my experience I have had the best results where the weights add up to 1.

CONTAINSTABLE(documentParts, content, 
          'ISABOUT (
              "wordA wordB wordC" weight (0.5), 
              "wordA*" NEAR "wordB*" NEAR "wordC*" weight (0.2), 
              "wordA*" weight (0.1), 
              "wordB*" weight (0.1), 
              "wordC*" weight (0.1) 
           ) ')


回答2:

Since the clock is ticking I end up with something like this which fetches quite good results...:

SELECT [KEY], SUM([RANK]) AS [RANK] FROM (
    SELECT [KEY], ([RANK]*1)/(SUM([RANK]) OVER( PARTITION BY 1)/ CAST(COUNT([RANK]) OVER( PARTITION BY 1) AS FLOAT)) AS [RANK] 
        FROM CONTAINSTABLE(documentParts, content, 
              'ISABOUT (
                  "wordA wordB wordC" weight (0.8), 
                  "wordA*" NEAR "wordB*" NEAR "wordC*" weight (0.6), 
                  "wordA*" weight (0.4), 
                  "wordB*" weight (0.4), 
                  "wordC*" weight (0.4) 
               ) ') c
        WHERE c.RANK>0
        UNION ALL      
        SELECT [KEY], ([RANK]*2)/(SUM([RANK]) OVER( PARTITION BY 1)/ CAST(COUNT([RANK]) OVER( PARTITION BY 1) AS FLOAT)) AS [RANK] 
        FROM CONTAINSTABLE(documents, title, 
              'ISABOUT (
                  "wordA wordB wordC" weight (0.8), 
                  "wordA*" NEAR "wordB*" NEAR "wordC*" weight (0.6), 
                  "wordA*" weight (0.4), 
                  "wordB*" weight (0.4), 
                  "wordC*" weight (0.4) 
               ) ') c
         WHERE c.RANK>0
    ) t 
    GROUP BY [KEY]
ORDER BY [RANK] DESC

I will pass it to the test team and call it a day...