Given a word W, I want to find all words containing the letters in W from /usr/dict/words. For example, "bat" should return "bat" and "tab" (but not "table").
Here is one solution which involves sorting the input word and matching:
word=$1
sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
while read line
do
sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
if [ "$sortedWord" == "$sortedLine" ]
then
echo $line
fi
done < /usr/dict/words
Is there a better way? I'd prefer using basic commands (instead of perl/awk etc), but all solutions are welcome!
To clarify, I want to find all permutations of the original word. Addition or deletion of characters is not allowed.
You want to find words containing only a given set of characters. A regex for that would be:
So, you could do:
The '^' matches the beginning of the line; '$' is for the end of the line. This means we must have an exact match, not just a partial match (e.g. "table").
'[' and ']' are used to define a group of possible characters allowed in one character space of the input file. We use this to find words in /usr/dict/word that only contain the characters in $W.
The '*' repeats the previous character (the '[...]' rule), which says to find a word of any length, where all the characters are in $W.
Use it like this:
bla.pl < /usr/dict/words searchword
here's an awk implementation. It finds the words with those letters in "W".
output
Update: change of algorithm, using sorting
output
This utility might interest you:
...gives
bat
andtab
only.The original author seems to not be around any more, but you can find information at http://packages.qa.debian.org/a/an.html (even if you don't want to use it itself, the source might be worth a look).
So we have the following:
n = length of input word
L = lines in dictionary file
If n tends to be small and L tends to be huge, might we be better off finding all permutations of the input word and looking for those, rather than doing something (like sorting) to all L lines of the dictionary file? (Actually, since finding all permutations of a word is O(n!), and we have to run through the entire dictionary file once for each word, maybe not, but I wrote the code anyway.)
This is Perl - I know you wanted command-line operations but I don't have a way to do that in shell script that's not super-hacky:
Here's a shell solution. The best algorithm seems to be #4. It filters out all words that are of incorrect length. Then, it sums the words using a simple substitution cipher (a=1, b=2, A=27, ...). If the sums match, then it will actually do the original sort and compare. On my system, it can churn through ~235k words looking for "bat" in just under 1/2 second. I'm providing all of my solutions so you can see the different approaches.
Update: not shown, but I also tried putting the sum inside the first bin of the histogram approach I tried, but it was even slower than the histograms without. I thought it would function as a short circuit, but it didn't work.
Update2: I tried the awk solution and it runs in about 1/3 the time of my best shell solution or ~0.126s versus ~0.490s. The perl solution runs ~1.1s.