Fast way to break a joined string of words into in

2020-02-29 10:52发布

Say I had this string:

hellohowareyou

Is there a fast way to separate this into individual words, so the end result is hello how are you? I can think of several ways, but they would be EXTREMELY slow (first I need to identify each letter against a dictionary, see which letters compose a words, and there would be probably multiple combinations, then I need to decide the most likely combination etc.)

3条回答
家丑人穷心不美
2楼-- · 2020-02-29 11:10

Here's some code that does a recursive brute-force search. It puts the word list into a set, so the lookups are quite fast: the examples below run in less than 1 second on my old 2 GHz machine with 2GB of RAM. However, splitting longer sequences than the examples I've used will certainly take longer, mostly because there are so many possible combinations. To weed out the meaningless results you will either need to do that manually, or use software that can do natural language processing.

#!/usr/bin/env python3

''' Separate words

    Use dictionary lookups to recursively split a string into separate words

    See http://stackoverflow.com/q/41241216/4014959

    Written by PM 2Ring 2016.12.21
'''

# Sowpods wordlist from http://www.3zsoftware.com/download/

fname = 'scrabble_wordlist_sowpods.txt'
allwords = set('AI')
with open(fname) as f:
    for w in f:
        allwords.add(w.strip())

def parse(data, result=None):
    if result is None:
        result = []
    if data in allwords:
        result.append(data)
        yield result[::-1]
    else:
        for i in range(1, len(data)):
            first, last = data[:i], data[i:]
            if last in allwords:
                yield from parse(first, result + [last])

# Test

data = (
    'HELLOHOWAREYOU',
    'THISEXAMPLEWORKSWELL',
    'ISTHEREAFASTWAY',
    'ONE',
    'TWOWORDS',
)

for s in data:
    print(s)
    for u in parse(s):
        print(u)
    print('')    

output

HELLOHOWAREYOU
['HELL', 'OHO', 'WARE', 'YOU']
['HELLO', 'HO', 'WARE', 'YOU']
['HELLO', 'HOW', 'ARE', 'YOU']
['HELL', 'OH', 'OW', 'ARE', 'YOU']
['HELLO', 'HOW', 'A', 'RE', 'YOU']
['HELL', 'OH', 'OW', 'A', 'RE', 'YOU']

THISEXAMPLEWORKSWELL
['THIS', 'EXAMPLE', 'WORK', 'SWELL']
['THIS', 'EX', 'AMPLE', 'WORK', 'SWELL']
['THIS', 'EXAMPLE', 'WORKS', 'WELL']
['THIS', 'EX', 'AMPLE', 'WORKS', 'WELL']

ISTHEREAFASTWAY
['I', 'ST', 'HER', 'EA', 'FAS', 'TWAY']
['IS', 'THERE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HERE', 'A', 'FAS', 'TWAY']
['IS', 'THE', 'RE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HE', 'RE', 'A', 'FAS', 'TWAY']
['I', 'ST', 'HER', 'EA', 'FAST', 'WAY']
['IS', 'THERE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HERE', 'A', 'FAST', 'WAY']
['IS', 'THE', 'RE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HE', 'RE', 'A', 'FAST', 'WAY']
['I', 'ST', 'HER', 'EA', 'FA', 'ST', 'WAY']
['IS', 'THERE', 'A', 'FA', 'ST', 'WAY']
['I', 'ST', 'HERE', 'A', 'FA', 'ST', 'WAY']
['IS', 'THE', 'RE', 'A', 'FA', 'ST', 'WAY']
['I', 'ST', 'HE', 'RE', 'A', 'FA', 'ST', 'WAY']

ONE
['ONE']

TWOWORDS
['TWO', 'WORDS']

This code was written for Python 3, but you can make it run on Python 2 by changing

yield from parse(first, result + [last])

to

for seq in parse(first, result + [last]):
    yield seq

BTW, we can sort the output lists by length, i.e., the number of words in each list. This tends to put the more sensible results near the top.

for s in data:
    print(s)
    for u in sorted(parse(s), key=len):
        print(u)
    print('')
查看更多
仙女界的扛把子
3楼-- · 2020-02-29 11:21

It's a "hard" problem, in that you need to employ some heuristics with more than just a dictionary behind it. You can turn a dictionary into a tree, to allow it to be efficiently searched letter by letter from the presumed start of a word, but the harder bit is where do you go when you run into a string of letters that's not in the dictionary. Things like "ABS" (a plastic) or "invac" (a bank employee's possible shorthand for "investment account") or "ncie" (a typo for "nice").

Oh, and there are also intrinsic ambiguities where a missing space makes a big difference to what follows. Consider "therapists" ... you need to be a human (or almost) to analyze the following context to work out if a space is needed after "the", or not.

查看更多
Juvenile、少年°
4楼-- · 2020-02-29 11:26

Thoughts:

Taking the sentence:

Isthereafastwaytoseparatethisintoindividualwordssotheendresultishellohow areyouIcanthinkofseveralwaysbuttheywouldbeEXTREMELYslowfirstIneedtoidentify eachletteragainstadictionaryseewhichletterscomposeawords...

A person would be able to split this up quite well into a sentence that makes sense. Therefore, a machine should be able to do the same.

Taking:

isthereafastwaytoseparate

"Is the reafastwaytoseparate..." should be "Is there a fast way to separate" Note that no matter how many letters are taken after reafast... they will never make a word.

Therefore, a possible, correct approach would be to run through the sentence finding the shortest possible words until the following word is not a word. This can be approximated with taking 15 letters before lengthening the original word.

In rare cases, you may need to go back to the word two previous or, in extremely rare cases, 2 or three words back. Additionally, the 15 letters could be too few for longer words.

Finally, if there are proper nouns or words from other languages, they will not be in the dictionary. Therefore, after a word cannot be found, the next word can be the new starting point and that word can be flagged or ignored. In a learning model, and this case, it should be added to the corpus or words.

Splitting the words into parts of speech (verbs, nouns, etc) could speed the process up, since an adjective is usually followed by a noun for example. But this may not be worth the effort as another adjective could follow. In any case, all words in the corpus must be tested against, as this isn't built to check grammar.

查看更多
登录 后发表回答