How do I determine if a random string sounds like

2019-01-17 20:48发布

I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.

EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.

13条回答
一纸荒年 Trace。
2楼-- · 2019-01-17 20:51

I would probably evaluate each word using a SOUNDEX algorithm against a database of english words. If you're doing this on a SQL-server it should be pretty easy to setup a database containing a list of most english words (using a freely available dictionary), and MSSQL server has SOUNDEX implemented as an available search-algorithm.

Obviously you can implement this yourself if you want, in any language - but it might be quite a task.

This way you'd get an evaluation of how much each word sounds like an existing english word, if any, and you could setup some limits for how low you'd want to accept results. You'd probably want to consider how to combine results for multiple words, and you would probably tweak the acceptance-limits based on testing.

查看更多
不美不萌又怎样
3楼-- · 2019-01-17 20:51

I'd suggest looking at the phi test and index of coincidence. http://www.threaded.com/cryptography2.htm

查看更多
姐就是有狂的资本
4楼-- · 2019-01-17 20:53

I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.

Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.

Soundex is very easy to implement - see Wikipedia for a description of the algorithm.

An example implementation of what you want to do would be:

def soundex(name, len=4):
    digits = '01230120022455012623010202'
    sndx = ''
    fc = ''

    for c in name.upper():
        if c.isalpha():
            if not fc: fc = c
            d = digits[ord(c)-ord('A')]
            if not sndx or (d != sndx[-1]):
                sndx += d

    sndx = fc + sndx[1:]
    sndx = sndx.replace('0','')
    return (sndx + (len * '0'))[:len]

real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]

if soundex(candidate) in soundex_cache:
    print "keep"
else:
    print "discard"

Obviously you'll need to provide an implementation of read_english_dictionary.

EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.

查看更多
贼婆χ
5楼-- · 2019-01-17 20:55

I'd suggest a few simple rules and standard pairs and triplets would be good.

For example, english sounding words tend to follow the pattern of vowel-consonant-vowel, apart from some dipthongs and standard consonant pairs (e.g. th, ie and ei, oo, tr). With a system like that you should strip out almost all words that don't sound like they could be english. You'd find on closer inspection that you will probably strip out a lot of words that do sound like english as well, but you can then start adding rules that allow for a wider range of words and 'train' your algorithm manually.

You won't remove all false negatives (e.g. I don't think you could manage to come up with a rule to include 'rythm' without explicitly coding in that rythm is a word) but it will provide a method of filtering.

I'm also assuming that you want strings that could be english words (they sound reasonable when pronounced) rather than strings that are definitely words with an english meaning.

查看更多
beautiful°
6楼-- · 2019-01-17 20:56

The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)

from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')

>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]

>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]
查看更多
▲ chillily
7楼-- · 2019-01-17 21:00

You can build a markov-chain of a huge english text.

Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.

See here: http://en.wikipedia.org/wiki/Markov_chain

At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.

In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.

查看更多
登录 后发表回答