So I have a database of words between 3 and 20 characters long. I want to code something in PHP that finds all of the smaller words that are contained within a larger word. For example, in the word "inward" there are the words "rain", "win", "rid", etc.
At first I thought about adding a field to the Words tables (Words3 through Words20, denoting the number of letters in the words), something like "LetterCount"... for example, "rally" would be represented as 10000000000200000100000010: 1 instances of the letter A, 0 instances of the letter B, ... 2 instances of the letter L, etc. Then, go through all the words in each table (or one table if the target length of found words was specified) and compare the LetterCount of each word to the LetterCount of the source word ("inward" in the example above).
But then I started thinking that that would place too much of a load on the MySQL database as well as the PHP script, calling each and every word's LetterCount, comparing each and every digit to that of the source word, etc.
Is there an easier, perhaps more intuitive way of doing this? I'm open to using stored procedures if it will help with overhead in any way. Just some suggestions would be greatly appreciated. Thanks!
Here is a simple solution that should be pretty efficient, but will only work up to certain size of words (probably about 15-20 characters it will break down, depending on whether the letters making up the word are low-frequency letters with lower values or high-frequency letters with higher values):
- Assign each letter a prime number according to it's frequency. So
e
is 2, t
= 3, a
= 5, etc. using frequency values from here or some similar source.
- Precalculate the value of each word in your word list by multiplying the prime values for the letters in the word, and store in the table in a
bigint
data type column. For instance, tea
would have a value of 3*2*5=30
. If a word has repeated letters, repeat the factor, so that teat
should have a value of 3*2*5*3=90
.
- When checking if a word, such as
rain
, is contained inside of another word, such as inward
, it's sufficient to check if the value for rain
divides the value for inward
. In this case, inward = 14213045
, rain = 7315
, and 14213045
is divisible by 7315
, so the word rain
is inside the word inward
.
- A bigint column maxes out at
9223372036854775807
, which should be fine up to about 15-20 characters (depending on the frequencies of letters in the word). For instance, I picked up the first 20-letter word from here, which is anitinstitutionalism
, and has a value of 6901041299724096525
which would just barely fit inside the bigint column. However, the 14-letter word xylopyrography
has a value of 635285791503081662905
, which is too big. You might have to handle the really large ones as special cases using an alternate method, but hopefully there's few enough of them that it would still be relatively efficient.
The query would work something like the demo I've prepared here: http://www.sqlfiddle.com/#!2/9bd27/8