So I have a database of words between 3 and 20 characters long. I want to code something in PHP that finds all of the smaller words that are contained within a larger word. For example, in the word "inward" there are the words "rain", "win", "rid", etc.
At first I thought about adding a field to the Words tables (Words3 through Words20, denoting the number of letters in the words), something like "LetterCount"... for example, "rally" would be represented as 10000000000200000100000010: 1 instances of the letter A, 0 instances of the letter B, ... 2 instances of the letter L, etc. Then, go through all the words in each table (or one table if the target length of found words was specified) and compare the LetterCount of each word to the LetterCount of the source word ("inward" in the example above).
But then I started thinking that that would place too much of a load on the MySQL database as well as the PHP script, calling each and every word's LetterCount, comparing each and every digit to that of the source word, etc.
Is there an easier, perhaps more intuitive way of doing this? I'm open to using stored procedures if it will help with overhead in any way. Just some suggestions would be greatly appreciated. Thanks!
Here is a simple solution that should be pretty efficient, but will only work up to certain size of words (probably about 15-20 characters it will break down, depending on whether the letters making up the word are low-frequency letters with lower values or high-frequency letters with higher values):
e
is 2,t
= 3,a
= 5, etc. using frequency values from here or some similar source.bigint
data type column. For instance,tea
would have a value of3*2*5=30
. If a word has repeated letters, repeat the factor, so thatteat
should have a value of3*2*5*3=90
.rain
, is contained inside of another word, such asinward
, it's sufficient to check if the value forrain
divides the value forinward
. In this case,inward = 14213045
,rain = 7315
, and14213045
is divisible by7315
, so the wordrain
is inside the wordinward
.9223372036854775807
, which should be fine up to about 15-20 characters (depending on the frequencies of letters in the word). For instance, I picked up the first 20-letter word from here, which isanitinstitutionalism
, and has a value of6901041299724096525
which would just barely fit inside the bigint column. However, the 14-letter wordxylopyrography
has a value of635285791503081662905
, which is too big. You might have to handle the really large ones as special cases using an alternate method, but hopefully there's few enough of them that it would still be relatively efficient.The query would work something like the demo I've prepared here: http://www.sqlfiddle.com/#!2/9bd27/8