Need help with optimizing a SQL query

2019-09-03 23:36发布

问题:

I desperately need help with a query that's been causing a lot of grief over the past 6 months on a high traffic website. I am a frontend developer with ability to write simple SQL queries and so I am unable to fix this issue on my own. The query is now locking mysql database frequently for lack of memory or CPU or interference from other queries on the VPS. I upgraded the hardware, but that alone didn't resolve the issue. So here's a description of what the query is trying to do:

User accesses a particular URL (say, the_source_url). Application tries to fetch related source_urls that other users (who have also accessed the_source_url) have accessed earlier, sorted by most frequently accessed to least frequently accessed. Basically, application is trying to find users with similar interest and show other pages they have accessed earlier.

This is the "complex query" that I wrote when I was young/stupid and the website had no traffic:

SELECT DISTINCT(SOURCE_URL), COUNT(SOURCE_URL) CATCOUNT 
  FROM topsources 
 WHERE SOURCE_URL <> ? 
   AND USER_ID IN (SELECT DISTINCT(USER_ID) 
                     FROM topsources WHERE SOURCE_URL = ?) 
GROUP BY SOURCE_URL ORDER BY CATCOUNT DESC

This is the table structure:

`topsources` (
  `USER_ID` varchar(255) NOT NULL,
  `DATE_AND_HOUR` varchar(255) NOT NULL,
  `UPDATED_TIME` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `ITEM_ID` int(11) NOT NULL,
  `SOURCE_URL` varchar(100) NOT NULL,
  `FEED_PAGE_URL` varchar(255) NOT NULL,
  `CATEGORY_URL` varchar(100) NOT NULL,
  `REFERRER` varchar(2048) DEFAULT NULL,
  PRIMARY KEY (`USER_ID`,`DATE_AND_HOUR`(30),`ITEM_ID`),
  KEY `USER_ID` (`USER_ID`),
  KEY `FEED_PAGE_URL` (`FEED_PAGE_URL`),
  KEY `SOURCE_URL` (`SOURCE_URL`),
  KEY `CATEGORY_URL` (`CATEGORY_URL`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Please note that I have already tried doing inner join instead of subselect but that doesn't work. i.e., the below query doesn't return the same result as that of the above query.

SELECT DISTINCT(ts.SOURCE_URL), COUNT(ts.SOURCE_URL) CATCOUNT FROM topsources ts INNER JOIN topsources tsi ON ts.USER_ID = tsi.USER_ID AND tsi.SOURCE_URL = ? WHERE ts.SOURCE_URL <> ? AND ts.CATEGORY_URL = ? GROUP BY ts.SOURCE_URL ORDER BY ts.CATCOUNT DESC

回答1:

IN subselects are not optimised in MySQL. Instead of performing the subselect, and then the outser select, MySQL performs the subselect for each row matched by the outer select. Because of this, subselects in the IN clause should be replaced with joins. Here is a faster version of the query:

SELECT DISTINCT(SOURCE_URL), COUNT(SOURCE_URL) CATCOUNT 
FROM topsources 
INNER JOIN 
(SELECT DISTINCT(USER_ID) 
                 FROM topsources WHERE SOURCE_URL = ?) as t
ON (topsources.USER_ID = t.USER_ID)
WHERE SOURCE_URL <> ? 
GROUP BY SOURCE_URL ORDER BY CATCOUNT DESC


回答2:

This is crying out for normalisation. What you really need is an url table like

id | url
1  | http://....

Then in your topsources table

id | url_id
1  | 1

Then your DISTINCT(url_id) should be significantly faster.