I want to write an algorithm for the following problem scenario
Taking periodic table elements' names, find largest word that can be formed?
The symbols such as Na
, Ne
etc should be regarded as single elements.
This was asked in a reputed company's job interview. Can anybody please help me out solve this.
Sorry, I am very confused by these answers. Surely the obvious answer is to sort the dictionary in alphabetical order by using nested bucket sorts up to some depth, say 8 letters, this should give you order 1 retrieval of words starting with a given sequence of letters up to 8.
Then go through the period tables doing DFS like it is a game tree. At each step you add an element and then check down the list to see if there are any words starting with that combination.
This will be relatively memory intensive, as you probably need at least 6 layers of buckets (alphabetise by the first six letters), but there are typically only 1,000,000 or so words. Since virtually every branch in the game tree will terminate after four letters, youre DFS will search the whole game tree quite fast. If you can form every word in the dictionary, it still must be the case that there are at most the same number of branches as there are words, and in practice you will only be able to get a fraction of the words int he dictionary, so this approach must be efficient.
Generate all possible words - naive approach.
Based on Generate all permutations of all lengths:
You can generate every possible combination, and check against a dictionary if its an acutal word. But it is inefficient and only for if you allow no repetitions of symbols.
Check if word can be built from symbols - still not perfect.
If you allow repetitions you need to check a word against the list of symbols. I propose the following:
It iteratively check if a word prefix is a symbol, then moves forward until the end of word is reached.
Output:
It still is not perfect.
Correct approach
As Makoto says, the prefix-check should return a list of every possible match. The algorithm should make a queue out of those matches, and check all possible paths. It would be kind of like building a graph of prefixes that match the word. And if one builds whole word you are home.
I think it is still fairly easy to correct my 2nd example, but I am out of time for the actual code. I think it's a good place for start.
I think a better way is to check every word in a dictionary and see if it can be made from the names of the elements. To check every permutation of the elements would be harder to program and would be less efficient.
While I agree it is easier to produce the combinations, there are just way too many of them, and as you said tend to infinity if you don't give a limit. The production of the word with the symbols would be slightly more difficult and finicky, but I don't think it would be too difficult.
E.g. when you get a word you could search through the elements looking for elements that could make up your word, then using that set of elements try and fill in the letters from start to finish. Obviously this gets harder with elements that aren't 2 letters and words that are odd in length.
You can actually use a sort of graph. Say you have 'silicon'. You can start with the letter 'S' or 'SI' which are both in the table. From there choose 'SI' because it is closer to your solution. If 'SI' does not lead to your solution you will have to come back and see if 'S' would work.
So it works as a kind of depth first search.
Generating all words and checking if they exist is impractical. There are 118^L words of length L, a too fast growing function. 1,643,032 three symbols words !
The other way round, as suggested by Makoto, is much more efficient. Try to reconstruct every word from a dictionary. There are on the order of 250,000 English words.
Checking a word is straightforward: see if any element symbol matches the start of the word, and continue with the remaining characters.
You must try all possible matches as a match could hide another. (I mean word ABC could be matched by symbol A and then be stuck because BC is not available, but it could be that AB and C are.) So the matching function will be recursive. I don't expect an exponential explosion in this matching process.
For maximum efficiency, you will store the symbols in a trie structure http://en.wikipedia.org/wiki/Trie.
Final hint: as you are just asked to find the longest match, try the words by decreasing lengths and stop at the first match.
Here is a simple solution in Python, without any optimization. Matching is done right to left, to allow printing the sequence of symbols in post-order:
Just want to write down my thoughts, and verified with others who professional with Algorithm, because I didn't see such question in other place.
BTW, I use Java.And here is my pseudocode:
1, Use the element symbol of Periodic Table to form a Array of String: String[] PeriTable.
2, Based on the backtracking thoughts, build a helper function called Backtracking(String result, String[] PeriTable, String[] used);
3, We iterate each element in PeriTable, and check the String it formed.
However,I don't know how to build isEnglishWord(); function.
How to express a given word as a chemical compound? Here's a dynamic programming solution. The important line is "Let progress[i] be the solution to the subproblem for word[:i]". The rest follows.
https://gist.github.com/hickford/750135db633832913703
Ran this on a whole dictionary, the longest compounds were:
NoNRePReSeNTaTiONaLiSm
ThErMoPHOsPHoReSCeNCe
BrONCHoEsOPHAgOsCoPY
HYPErPHOsPHoReSCeNCe
HYPSiBrAcHYCePHAlISm