I was looking for a module, regex, or anything else that might apply to this problem.
How can I programatically parse the string and create known English &| Spanish words given that I have a dictionary table against which I can check each permutation of the algorithm's randomization for a match?
Given a group of characters: EBLAIDL KDIOIDSI ADHFWB
The program should return: BLADE
AID
KID
KIDS
FIDDLE
HOLA
etc....
I also want to be able to define the minimum & maximum word length as well as the number of syllables
The input length doesn't matter, it must be only letters, and punctuation doesn't matter.
Thanks for any help
EDIT
Letters in the input string can be reused.
For example, if the input is: ABLED
then the output may contain: BALL
or BLEED
The only way I can imagine this would work would be to parse through all possible combinations of letters, and compare them against the dictionary. The fastest way to compare them against a dictionary is to turn that dictionary into a hash. That way, you can quickly look up whether the word was a valid word.
I key my dictionary by lower casing all letters in the dictionary word and then removing any non-alpha characters just to be on the safe side. For the value, I'll store the actual dictionary word. For example:
That way, I can display the correctly spelled word.
I found Math::Combinatorics which looked pretty good, but wasn't quite working the way I hoped. You give it a list of letters, and it will return all combinations of those letters in the number of letters you specify. Thus, I thought all I had to do was convert the letters into a list of individual letters, and simply loop through all possible combinations!
No... That gives me all unordered combinations. What I then had to do was with each combination, list all possible permutations of those letters. Blah! Ptooy! Yech!
So, the infamous looping in a loop. Actually, three loops. * The outer loop simply count down all numbers of combinations from 1 to the number of letters in the word. * The next finds all unordered combinations of each of those letter groups. * Finally, the last one takes all unordered combinations and returns a list of permutations from those combinations.
Now, I can finally take those permutations of letters and compare it against my dictionary of words. Surprisingly, the program ran much faster than I expected considering it had to turn a 235,886 word dictionary into a hash, then loop through a triple decker loop to find all permutations of all combinations of all possible number of letters. The whole program ran in less than two seconds.
Running this program produced:
Well, the regexp is fairly easy... Then you just need to iterate through the words in the dictionary. EG, assuming a standard linux:
Will quickly return all the words in that file containing those and only those letters.
As you can see, though, you need a dictionary file that is worth having. In particular, /usr/share/dict/words on my Fedora system contains a bunch of words with all As which may or may not be something you want. So pick your dictionary file carefully.
For min a max length, you can quickly get that as well:
Will produce:
For breaking words into pieces and counting the syllables is very language specific, as has been mentioned in the comments above.
You haven't specified, so I'm assuming each letter in the input can only be used once.
[You have since specified letters in the input can be used more than once, but I'm going to leave this post here in case someone finds it useful.]
The key to doing this efficiently is to sort the letters in the words.
Then it becomes clear that "drab" is in "abracadabra".
And that "abroad" isn't.
Let's call the sorted letter the "signature". Word "B" in is in word "A" if you can remove letters from the signature of "A" to get the signature of "B". That's easy to check using a regex pattern.
Or if if we eliminate needless backtracking for efficiency, we get
Now that we know what pattern we want, it's just a matter of building it.
Example:
Maybe it would help if you create a separate table with the 26 letters of the alphabet. Than, you would build a query that will search on the second database for any letter you defined. It is important that the query assures that each result is unique.
So, you have a table that contains your words, and you have a relation of many to many to another table that contains all the letters of the alphabets. And you would query on this second table and make the results unique. You could have a similar approach to the number of the letters.
You could use the same approach for the number of letters and syllables. So you would make one query that would be joining all the information you want. Put the right indexes on the database to help performance, make use of appropriate caching and, if it comes to that, you can parallelize searches.