I've wrote this query without much thought but as a beginner I'm almost sure it could be written better.
Here it's:
SELECT filehash, filename, filesize, group_files
FROM files
INNER JOIN ( SELECT filehash group_id,
COUNT(filehash) group_files
FROM files
GROUP BY filehash) groups
ON files.filehash = groups.group_id
ORDER BY group_files DESC,
filesize DESC
Table definition:
CREATE TABLE files (fileid INTEGER PRIMARY KEY AUTOINCREMENT,
filename TEXT,
filesize INTEGER,
filehash TEXT)
Indexes definition:
CREATE INDEX files_filehash_idx
ON files(filehash)
CREATE UNIQUE INDEX files_filename_idx
ON files(filename)
CREATE INDEX files_filesize_idx
ON files(filesize)
Query EXPLAIN QUERY PLAN:
selectid order from detail
1 0 0 SCAN TABLE files USING COVERING INDEX files_filehash_idx (~1000000 rows)
0 0 1 SCAN SUBQUERY 1 AS groups (~100 rows)
0 1 0 SEARCH TABLE files USING INDEX files_filehash_idx (filehash=?) (~10 rows)
0 0 0 USE TEMP B-TREE FOR ORDER BY
Could you correct me if I'm wrong? Thank you in advance.
Nope. Looks spot on to me.
I don't think you need the index on filename for this query. There are plans where the index on file size would help, but MySQL isn't using them for this. You might be better off replacing the two separate indexes with a compound index on (filehash, filesize). Or you might not!
What do you think about this version?
It seems like that will probably run faster. Does it do what you need?