T-SQL - How can I make a SELECT query with multipl

2019-04-25 08:04发布

问题:

I've got a search function for news articles that looks like this (contains more than 5 search items):

SELECT TOP 5 *
FROM NewsArticles
WHERE (headline LIKE '% sustainable %'OR
headline LIKE '% sustainability %' OR
headline LIKE '% conservation %' OR
headline LIKE '% environmental % OR
headline LIKE '% environmentally %') 
OR
(body LIKE '% sustainable %'OR
body LIKE '% sustainability %' OR
body LIKE '% conservation %' OR
body LIKE '% environmental % OR
body LIKE '% environmentally %')
ORDER BY publishDate DESC 

This query is designed to pull out the top 5 news stories relating to sustainability and sits on my main sustainability homepage. However, it takes a while to run and the page is slow to load. So I'm looking up ways to speed this up. Having so many LIKE clauses seems cumbersome so I've tried something with a JOIN like this:

CREATE TABLE #SearchItem (Search varchar(255))

INSERT INTO #SearchItem VALUES
('sustainable'),
('sustainability'),
('conservation'),
('environmental'),
('environmentally')

SELECT TOP 5 *
FROM NewsArticles as n
JOIN #SearchItem as s
ON n.headline COLLATE DATABASE_DEFAULT LIKE '% ' + s.Search + ' %' OR 
n.body COLLATE DATABASE_DEFAULT LIKE '% ' + s.Search + ' %'
ORDER BY n.publishDate DESC

This seems to work very well for performance, but seems to sometimes bring back duplicate articles where one of the search words appears in both the body and the headline (which is often the case). I've tried using the word using 'SELECT DISTINCT TOP 5 *' but this gives me an error saying 'The ntext data type cannot be selected as DISTINCT because it is not comparable'. Is there away of stopping this from bringing back duplicates without doing 2 separate searches and using UNION?

回答1:

Since you get multiple hits on multiple words, you can use the selected ID's as a filter for the actual selection of the articles:

Select TOP 5 * 
from NewsArticles 
where ID in (SELECT ID
    FROM NewsArticles as n
    JOIN #SearchItem as s
    ON n.headline COLLATE DATABASE_DEFAULT LIKE '% ' + s.Search + ' %' OR 
       n.body COLLATE DATABASE_DEFAULT LIKE '% ' + s.Search + ' %'
)
ORDER BY publishDate DESC

It should still be reasonably fast (compared to the original query) and duplicate-free.

(as in Rawheiser's response, there is an assumption that an ID field actually exists :))



回答2:

If you are doing these types of searches, you should use full text search. You need to read up in BOL about how to set this up as it is complicated. However when you have a wildcard as the first character, then SQL server cannot use indexes which is why this is slow.



回答3:

If there is an news Article key you can use a query to that joins back to itself such as:

select top 5 *
from NewsArticles as na
join 
    ( 
        SELECT distinct idNo , publishDate 
        FROM NewsArticles as n
        JOIN #SearchItem as s
        ON n.headline COLLATE DATABASE_DEFAULT LIKE '% ' + s.Search + ' %' OR 
        n.body COLLATE DATABASE_DEFAULT LIKE '% ' + s.Search + ' %'
    ) as sk
        on sk.idNo = na.idNo
ORDER BY sk.publishDate DESC


回答4:

You could also try Full-Text Search. Something like

SELECT TOP 5 * FROM NewsArticles 
WHERE CONTAINS((headline,body), 'FORMSOF(INFLECTIONAL, sustainable) OR conservation OR FORMSOF(INFLECTIONAL, environmental)')

But, as HLGEM says, it will be important to read up on setting up FTS. Just creating an index for those two columns may be enough since you are searching for single words, but once you add phrases you want to start editing stop words and breakers and noise words.



回答5:

I would so something like this

I'd program an case insensitive indexing process that would run at nights, that would:

  • parse articles, insert in SEARCH_TERM any new word it finds
  • insert in SEARCH_INDEX a tuple that says that a given article has a given search term
  • mark the NEWS_ARTICLE as already indexed.
  • next run will index only articles tha haven't been idexed

The a sample query can be like this, and would be lighting fast since you will not be using LIKE:

select
    distinct n.headline
from
    search_item s
    join search_index x on (s.id = x.search_item_id)
    join news_article n on (x.news_article_id = n.id)
where
    s.term in ('sustainable','sustainability','conservation',
               'environmental','environmentally')

Later you could create synonyms for search terms and create a view that returns terms with their synonyms, should you want to make more semantic searches in the future.