可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I was looking for a module, regex, or anything else that might apply to this problem.
How can I programatically parse the string and create known English &| Spanish words given that I have a dictionary table against which I can check each permutation of the algorithm's randomization for a match?
Given a group of characters: EBLAIDL KDIOIDSI ADHFWB
The program should return: BLADE
AID
KID
KIDS
FIDDLE
HOLA
etc....
I also want to be able to define the minimum & maximum word length as well as the number of syllables
The input length doesn't matter, it must be only letters, and punctuation doesn't matter.
Thanks for any help
EDIT
Letters in the input string can be reused.
For example, if the input is: ABLED
then the output may contain: BALL
or BLEED
回答1:
You haven't specified, so I'm assuming each letter in the input can only be used once.
[You have since specified letters in the input can be used more than once, but I'm going to leave this post here in case someone finds it useful.]
The key to doing this efficiently is to sort the letters in the words.
abracadabra => AAAAABBCDRR
abroad => AABDOR
drab => ABDR
Then it becomes clear that "drab" is in "abracadabra".
abracadabra => AAAAABBCDRR
drab => A B DR
And that "abroad" isn't.
abracadabra => AAAAABBCD RR
abroad => AA B DOR
Let's call the sorted letter the "signature". Word "B" in is in word "A" if you can remove letters from the signature of "A" to get the signature of "B". That's easy to check using a regex pattern.
sig('drab') =~ /^A?A?A?A?A?B?B?C?D?R?R?\z/
Or if if we eliminate needless backtracking for efficiency, we get
sig('drab') =~ /^A?+A?+A?+A?+A?+B?+B?+C?+D?+R?+R?+\z/
Now that we know what pattern we want, it's just a matter of building it.
use strict;
use warnings;
use feature qw( say );
sub sig { join '', sort grep /^\pL\z/, split //, uc $_[0] }
my $key = shift(@ARGV);
my $pat = sig($key);
$pat =~ s/.\K/?+/sg;
my $re = qr/^(?:$pat)\z/s;
my $shortest = 9**9**9;
my $longest = 0;
my $count = 0;
while (my $word = <>) {
chomp($word);
next if !length($word); # My dictionary starts with a blank line!!
next if sig($word) !~ /$re/;
say $word;
++$count;
$shortest = length($word) if length($word) < $shortest;
$longest = length($word) if length($word) > $longest;
}
say "Words: $count";
if ($count) {
say "Shortest: $shortest";
say "Longest: $longest";
}
Example:
$ perl script.pl EBLAIDL /usr/share/dict/words
A
Abe
Abel
Al
...
libel
lid
lie
lied
Words: 117
Shortest: 1
Longest: 6
回答2:
Well, the regexp is fairly easy... Then you just need to iterate through the words in the dictionary. EG, assuming a standard linux:
# perl -n -e 'print if (/^[EBLAIDL]+$/);' /usr/share/dict/words
Will quickly return all the words in that file containing those and only those letters.
A
AA
AAA
AAAA
AAAAAA
AAAL
AAE
AAEE
AAII
AB
...
As you can see, though, you need a dictionary file that is worth
having. In particular, /usr/share/dict/words on my Fedora system
contains a bunch of words with all As which may or may not be
something you want. So pick your dictionary file carefully.
For min a max length, you can quickly get that as well:
$min = 9999;
$max = -1;
while(<>) {
if (/[EBLAIDL]+$/) {
print;
chomp;
if (length($_) > $max) {
$max = length($_);
$maxword = $_;
}
if (length($_) < $min) {
$min = length($_);
$minword = $_;
}
}
}
print "longest: $maxword\n";
print "shortest: $minword\n";
Will produce:
ZI
ZMRI
ZWEI
longest: TANSTAAFL
shortest: A
For breaking words into pieces and counting the syllables is very language specific, as has been mentioned in the comments above.
回答3:
The only way I can imagine this would work would be to parse through all possible combinations of letters, and compare them against the dictionary. The fastest way to compare them against a dictionary is to turn that dictionary into a hash. That way, you can quickly look up whether the word was a valid word.
I key my dictionary by lower casing all letters in the dictionary word and then removing any non-alpha characters just to be on the safe side. For the value, I'll store the actual dictionary word. For example:
cant => "can't",
google => "Google",
That way, I can display the correctly spelled word.
I found Math::Combinatorics which looked pretty good, but wasn't quite working the way I hoped. You give it a list of letters, and it will return all combinations of those letters in the number of letters you specify. Thus, I thought all I had to do was convert the letters into a list of individual letters, and simply loop through all possible combinations!
No... That gives me all unordered combinations. What I then had to do was with each combination, list all possible permutations of those letters. Blah! Ptooy! Yech!
So, the infamous looping in a loop. Actually, three loops.
* The outer loop simply count down all numbers of combinations from 1 to the number of letters in the word.
* The next finds all unordered combinations of each of those letter groups.
* Finally, the last one takes all unordered combinations and returns a list of permutations from those combinations.
Now, I can finally take those permutations of letters and compare it against my dictionary of words. Surprisingly, the program ran much faster than I expected considering it had to turn a 235,886 word dictionary into a hash, then loop through a triple decker loop to find all permutations of all combinations of all possible number of letters. The whole program ran in less than two seconds.
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);
use autodie;
use Data::Dumper;
use Math::Combinatorics;
use constant {
LETTERS => "EBLAIDL",
DICTIONARY => "/usr/share/dict/words",
};
#
# Create Dictionary Hash
#
open my $dict_fh, "<", DICTIONARY;
my %dictionary;
foreach my $word (<$dict_fh>) {
chomp $word;
(my $key = $word) =~ s/[^[:alpha:]]//;
$dictionary{lc $key} = $word;
}
#
# Now take the letters and create a Perl list of them.
#
my @letter_list = split // => LETTERS;
my %valid_word_hash;
#
# Outer Loop: This is a range from one letter combinations to the
# maximum letters combination
#
foreach my $num_of_letters (1..scalar @letter_list) {
#
# Now we generate a reference to a list of lists of all letter
# combinations of $num_of_letters long. From there, we need to
# take the Permutations of all those letters.
#
foreach my $letter_list_ref (combine($num_of_letters, @letter_list)) {
my @letter_list = @{$letter_list_ref};
# For each combination of letters $num_of_letters long,
# we now generate a permeation of all of those letter
# combinations.
#
foreach my $word_letters_ref (permute(@letter_list)) {
my $word = join "" => @{$word_letters_ref};
#
# This $word is just a possible candidate for a word.
# We now have to compare it to the words in the dictionary
# to verify it's a word
#
$word = lc $word;
if (exists $dictionary{$word}) {
my $dictionary_word = $dictionary{$word};
$valid_word_hash{$word} = $dictionary_word;
}
}
}
}
#
# I got lazy here... Just dumping out the list of actual words.
# You need to go through this list to find your longest and
# shortest words. Number of syllables? That's trickier, you could
# see if you can divide on CVC and CVVC divides where C = consonant
# and V = vowel.
#
say join "\n", sort keys %valid_word_hash;
Running this program produced:
$ ./test.pl | column
a al balei bile del i lai
ab alb bali bill delia iba laid
abdiel albe ball billa dell ibad lea
abe albi balled billed della id lead
abed ale balli blad di ida leal
abel alible be blade dial ide led
abide all bea blae dib idea leda
abie alle bead d die ideal lei
able allie beal da dieb idle leila
ad allied bed dab dill ie lelia
ade b beid dae e ila li
adib ba bel dail ea ill liable
adiel bad bela dal ed l libel
ae bade beld dale el la lid
ai bae belial dali elb lab lida
aid bail bell dalle eld label lide
aide bal bella de eli labile lie
aiel bald bid deal elia lad lied
ail baldie bide deb ell lade lila
aile bale bield debi ella ladle lile
回答4:
Maybe it would help if you create a separate table with the 26 letters of the alphabet. Than, you would build a query that will search on the second database for any letter you defined. It is important that the query assures that each result is unique.
So, you have a table that contains your words, and you have a relation of many to many to another table that contains all the letters of the alphabets. And you would query on this second table and make the results unique. You could have a similar approach to the number of the letters.
You could use the same approach for the number of letters and syllables. So you would make one query that would be joining all the information you want. Put the right indexes on the database to help performance, make use of appropriate caching and, if it comes to that, you can parallelize searches.