MySQL: Optimized query to find matching strings fr

2019-09-09 02:24发布

问题:

I am having 10 sets of strings each set having 9 strings. Of this 10 sets, all strings in first set have length 10, those in second set have length 9 and so on. Finally, all strings in 10th set have length 1.

There is common prefix of (length-2) characters in each set. And the prefix length reduces by 1 in next set. Thus, first set has 8 characters in common, second has 7 and so on.

Here is what a sample of 10 sets look like:

pu3q0k0vwn
pu3q0k0vwp
pu3q0k0vwr
pu3q0k0vwq
pu3q0k0vwm
pu3q0k0vwj
pu3q0k0vtv
pu3q0k0vty
pu3q0k0vtz

pu3q0k0vw
pu3q0k0vy
pu3q0k0vz
pu3q0k0vx
pu3q0k0vr
pu3q0k0vq
pu3q0k0vm
pu3q0k0vt
pu3q0k0vv

pu3q0k0v
pu3q0k0y
pu3q0k1n
pu3q0k1j
pu3q0k1h
pu3q0k0u
pu3q0k0s
pu3q0k0t
pu3q0k0w

pu3q0k0
pu3q0k2
pu3q0k3
pu3q0k1
pu3q07c
pu3q07b
pu3q05z
pu3q0hp
pu3q0hr

pu3q0k
pu3q0m
pu3q0t
pu3q0s
pu3q0e
pu3q07
pu3q05
pu3q0h
pu3q0j

pu3q0
pu3q2
pu3q3
pu3q1
pu3mc
pu3mb
pu3jz
pu3np
pu3nr

pu3q
pu3r
pu3x
pu3w
pu3t
pu3m
pu3j
pu3n
pu3p

pu3
pu9
pud
pu6
pu4
pu1
pu0
pu2
pu8

pu
pv
0j
0h
05
pg
pe
ps
pt

p
r
2
0
b
z
y
n
q

Requirement: I have a table PROFILES having columns SRNO (type bigint, primary key) and UNIQUESTRING (type char(10), unique key). I want to find 450 SRNOs for matching UNIQUESTRINGs from those 10 sets.

First find strings like in the first set. If we don't get enough results (ie. 450), find strings like in second set. If we still don't get enough results (450 minus results of first set) find strings like in third set. And so on.

Existing Solution: I've written query something like:

select srno from  profiles
    where  ( (uniquestring like 'pu3q0k0vwn%')
              or  (uniquestring like 'pu3q0k0vwp%') -- all those above uniquestrings after this and finally the last one
              or  (uniquestring like 'n%')
              or  (uniquestring like 'q%')
           )
    limit  450

However, after getting feedback from Rick James in this answer I realized this is not optimized query as it touches lot many rows than it needs. So I plan to rewrite the query like this:

(select srno from  profiles where uniquestring like 'pu3q0k0vwn%' LIMIT 450)
UNION DISTINCT
(select srno from  profiles where uniquestring like 'pu3q0k0vwp%' LIMIT 450); -- and more such clauses after this for each uniquestring 

I like to know if there are any better solutions to do this.

回答1:

SELECT ...
    WHERE str   LIKE  'pu3q0k0vw%' AND -- the 10-char set
          str REGEXP '^pu3q0k0vw[nprqmj]'  -- the 9 next letters
    LIMIT ...
# then check for 450; if not enough, continue...
SELECT ...
    WHERE str   LIKE  'pu3q0k0vt%' AND -- the 10-char set
          str REGEXP '^pu3q0k0vt[vyz]'  -- the 9 next letters
    LIMIT 450
# then check for 450; if not enough, continue...
etc.
SELECT ...
    WHERE str   LIKE  'pu3q0k0v%' AND -- the 9-char set
          str REGEXP '^pu3q0k0v[wyzxrqmtv]'  -- the 9 next letters
    LIMIT ...
# check, etc; for a total of 10 SELECTs or 450 rows, whichever comes first.

This will be 10+ selects. Each select will be somewhat optimized by first picking rows with a common prefix with LIKE, then it double checks with a REGEXP.

(If you don't like splitting the inconsistent pu3q0k0vw vs. pu3q0k0vt; we can discuss things further.)

You say "prefix"; I have coded the LIKE and REGEXP to assume arbitrary text after the prefix given.

UNION is not viable, since it will (I think) gather all the rows before picking 450. Each SELECT will stop at the LIMIT if there is no DISTINCT GROUP BY or ORDER BY that require gathering everything first.

REGEXP is not smart enough to avoid scanning the entire table; adding the LIKE avoids such (except when more than, say, 20% of the rows match the LIKE).