Better query strategy to sort files by file hash f

2019-06-01 09:53发布

问题:

I've wrote this query without much thought but as a beginner I'm almost sure it could be written better.

Here it's:

    SELECT filehash, filename, filesize, group_files
      FROM files
INNER JOIN (  SELECT filehash group_id,
                     COUNT(filehash) group_files
                FROM files
            GROUP BY filehash) groups
        ON files.filehash = groups.group_id
  ORDER BY group_files DESC,
           filesize DESC

Table definition:

CREATE TABLE files (fileid INTEGER PRIMARY KEY AUTOINCREMENT,
                    filename TEXT,
                    filesize INTEGER,
                    filehash TEXT)

Indexes definition:

CREATE INDEX files_filehash_idx
          ON files(filehash)
CREATE UNIQUE INDEX files_filename_idx
                 ON files(filename)
CREATE INDEX files_filesize_idx
          ON files(filesize)

Query EXPLAIN QUERY PLAN:

selectid    order   from    detail
1           0       0       SCAN TABLE files USING COVERING INDEX files_filehash_idx (~1000000 rows)
0           0       1       SCAN SUBQUERY 1 AS groups (~100 rows)
0           1       0       SEARCH TABLE files USING INDEX files_filehash_idx (filehash=?) (~10 rows)
0           0       0       USE TEMP B-TREE FOR ORDER BY

Could you correct me if I'm wrong? Thank you in advance.

回答1:

What do you think about this version?

  select filehash, group_concat(filename), filesize, count(*) as group_files
    from files
group by filehash
order by group_files desc

It seems like that will probably run faster. Does it do what you need?



回答2:

Nope. Looks spot on to me.

I don't think you need the index on filename for this query. There are plans where the index on file size would help, but MySQL isn't using them for this. You might be better off replacing the two separate indexes with a compound index on (filehash, filesize). Or you might not!