How can I generate a list of words from a group of

2019-05-04 10:06发布

站内文章 / 前沿技术

19 0

我命由我不由天

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I was looking for a module, regex, or anything else that might apply to this problem.

How can I programatically parse the string and create known English &| Spanish words given that I have a dictionary table against which I can check each permutation of the algorithm's randomization for a match?

Given a group of characters: EBLAIDL KDIOIDSI ADHFWB

The program should return: BLADE AID KID KIDS FIDDLE HOLA etc....

I also want to be able to define the minimum & maximum word length as well as the number of syllables

The input length doesn't matter, it must be only letters, and punctuation doesn't matter.

Thanks for any help

EDIT
Letters in the input string can be reused.

For example, if the input is: ABLED then the output may contain: BALL or BLEED

回答1:

You haven't specified, so I'm assuming each letter in the input can only be used once.

[You have since specified letters in the input can be used more than once, but I'm going to leave this post here in case someone finds it useful.]

The key to doing this efficiently is to sort the letters in the words.

abracadabra => AAAAABBCDRR
abroad      => AABDOR
drab        => ABDR

Then it becomes clear that "drab" is in "abracadabra".

abracadabra => AAAAABBCDRR
drab        => A    B  DR

And that "abroad" isn't.

abracadabra => AAAAABBCD RR
abroad      => AA   B  DOR

Let's call the sorted letter the "signature". Word "B" in is in word "A" if you can remove letters from the signature of "A" to get the signature of "B". That's easy to check using a regex pattern.

sig('drab') =~ /^A?A?A?A?A?B?B?C?D?R?R?\z/

Or if if we eliminate needless backtracking for efficiency, we get

sig('drab') =~ /^A?+A?+A?+A?+A?+B?+B?+C?+D?+R?+R?+\z/

Now that we know what pattern we want, it's just a matter of building it.

use strict;
use warnings;
use feature qw( say );

sub sig { join '', sort grep /^\pL\z/, split //, uc $_[0] }

my $key = shift(@ARGV);

my $pat = sig($key);
$pat =~ s/.\K/?+/sg;
my $re = qr/^(?:$pat)\z/s;

my $shortest = 9**9**9;
my $longest  = 0;
my $count    = 0;
while (my $word = <>) {
   chomp($word);
   next if !length($word);  # My dictionary starts with a blank line!! 
   next if sig($word) !~ /$re/;
   say $word;
   ++$count;
   $shortest = length($word) if length($word) < $shortest;
   $longest  = length($word) if length($word) > $longest;
}

say "Words:    $count";
if ($count) {
   say "Shortest: $shortest";
   say "Longest:  $longest";
}

Example:

$ perl script.pl EBLAIDL /usr/share/dict/words
A
Abe
Abel
Al
...
libel
lid
lie
lied

Words:    117
Shortest: 1
Longest:  6

回答2:

Well, the regexp is fairly easy... Then you just need to iterate through the words in the dictionary. EG, assuming a standard linux:

# perl -n -e 'print if (/^[EBLAIDL]+$/);' /usr/share/dict/words

Will quickly return all the words in that file containing those and only those letters.

A
AA
AAA
AAAA
AAAAAA
AAAL
AAE
AAEE
AAII
AB
...

As you can see, though, you need a dictionary file that is worth having. In particular, /usr/share/dict/words on my Fedora system contains a bunch of words with all As which may or may not be something you want. So pick your dictionary file carefully.

For min a max length, you can quickly get that as well:

$min = 9999;
$max = -1;
while(<>) {
   if (/[EBLAIDL]+$/) {
      print;
  chomp;
      if (length($_) > $max) {
     $max = length($_);
     $maxword = $_;
      }
      if (length($_) < $min) {
     $min = length($_);
     $minword = $_;
      }
   }
}

print "longest: $maxword\n";
print "shortest: $minword\n";

Will produce:

ZI
ZMRI
ZWEI
longest: TANSTAAFL
shortest: A

For breaking words into pieces and counting the syllables is very language specific, as has been mentioned in the comments above.

回答3:

The only way I can imagine this would work would be to parse through all possible combinations of letters, and compare them against the dictionary. The fastest way to compare them against a dictionary is to turn that dictionary into a hash. That way, you can quickly look up whether the word was a valid word.

I key my dictionary by lower casing all letters in the dictionary word and then removing any non-alpha characters just to be on the safe side. For the value, I'll store the actual dictionary word. For example:

cant =>   "can't",
google => "Google",

That way, I can display the correctly spelled word.

I found Math::Combinatorics which looked pretty good, but wasn't quite working the way I hoped. You give it a list of letters, and it will return all combinations of those letters in the number of letters you specify. Thus, I thought all I had to do was convert the letters into a list of individual letters, and simply loop through all possible combinations!

No... That gives me all unordered combinations. What I then had to do was with each combination, list all possible permutations of those letters. Blah! Ptooy! Yech!

So, the infamous looping in a loop. Actually, three loops. * The outer loop simply count down all numbers of combinations from 1 to the number of letters in the word. * The next finds all unordered combinations of each of those letter groups. * Finally, the last one takes all unordered combinations and returns a list of permutations from those combinations.

Now, I can finally take those permutations of letters and compare it against my dictionary of words. Surprisingly, the program ran much faster than I expected considering it had to turn a 235,886 word dictionary into a hash, then loop through a triple decker loop to find all permutations of all combinations of all possible number of letters. The whole program ran in less than two seconds.

#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);
use autodie;
use Data::Dumper;

use Math::Combinatorics;

use constant {
    LETTERS => "EBLAIDL",
    DICTIONARY => "/usr/share/dict/words",
};

#
# Create Dictionary Hash
#

open my $dict_fh, "<", DICTIONARY;
my %dictionary;
foreach my $word (<$dict_fh>) {
    chomp $word;
    (my $key = $word) =~ s/[^[:alpha:]]//;
    $dictionary{lc $key} = $word;
}

#
# Now take the letters and create a Perl list of them.
#

my @letter_list =  split  // => LETTERS;
my %valid_word_hash;

#
# Outer Loop: This is a range from one letter combinations to the
# maximum letters combination
#
foreach my $num_of_letters (1..scalar @letter_list) {

    #
    # Now we generate a reference to a list of lists of all letter
    # combinations of $num_of_letters long. From there, we need to
    # take the Permutations of all those letters.
    #
    foreach my $letter_list_ref (combine($num_of_letters, @letter_list)) {
        my @letter_list = @{$letter_list_ref};

        # For each combination of letters $num_of_letters long,
        # we now generate a permeation of all of those letter
        # combinations.
        #
        foreach my $word_letters_ref (permute(@letter_list)) {
            my $word = join "" => @{$word_letters_ref};

            #
            # This $word is just a possible candidate for a word.
            # We now have to compare it to the words in the dictionary
            # to verify it's a word
            #
            $word = lc $word;
            if (exists $dictionary{$word}) {
                my $dictionary_word = $dictionary{$word};
                $valid_word_hash{$word} = $dictionary_word;
            }
        }
    }
}

#
# I got lazy here... Just dumping out the list of actual words.
# You need to go through this list to find your longest and
# shortest words. Number of syllables? That's trickier, you could
# see if you can divide on CVC and CVVC divides where C = consonant
# and V = vowel.
#
say join "\n", sort keys %valid_word_hash;

Running this program produced:

$ ./test.pl | column
a          al             balei          bile           del            i              lai
ab         alb            bali           bill           delia          iba            laid
abdiel     albe           ball           billa          dell           ibad           lea
abe        albi           balled         billed         della          id             lead
abed       ale            balli          blad           di             ida            leal
abel       alible         be             blade          dial           ide            led
abide      all            bea            blae           dib            idea           leda
abie       alle           bead           d              die            ideal          lei
able       allie          beal           da             dieb           idle           leila
ad         allied         bed            dab            dill           ie             lelia
ade        b              beid           dae            e              ila            li
adib       ba             bel            dail           ea             ill            liable
adiel      bad            bela           dal            ed             l              libel
ae         bade           beld           dale           el             la             lid
ai         bae            belial         dali           elb            lab            lida
aid        bail           bell           dalle          eld            label          lide
aide       bal            bella          de             eli            labile         lie
aiel       bald           bid            deal           elia           lad            lied
ail        baldie         bide           deb            ell            lade           lila
aile       bale           bield          debi           ella           ladle          lile

回答4:

Maybe it would help if you create a separate table with the 26 letters of the alphabet. Than, you would build a query that will search on the second database for any letter you defined. It is important that the query assures that each result is unique.

So, you have a table that contains your words, and you have a relation of many to many to another table that contains all the letters of the alphabets. And you would query on this second table and make the results unique. You could have a similar approach to the number of the letters.

You could use the same approach for the number of letters and syllables. So you would make one query that would be joining all the information you want. Put the right indexes on the database to help performance, make use of appropriate caching and, if it comes to that, you can parallelize searches.