Justadistraction: tokenizing English without white

2019-04-27 20:43发布

问题:

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?

The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'

In the novel, the Sheep Man is translated as saying things like:

"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."

So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.

What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?

Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.

I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers. Thanks!

回答1:

I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.



回答2:

First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.

Once you have the words, you have two possible approaches:

You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.

On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.

I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.



回答3:

I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.



回答4:

This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution

 textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."

indiv_characters = list(textstring) #splits string into individual characters

teststring = ''
sequential_indiv_word_list = []

for cur_char in indiv_characters:
    teststring = teststring + cur_char
    # do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
    if in_english_dict == True:
        sequential_indiv_word_list.append(teststring)
        teststring = ''

#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word

There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.

Also you need to account for punctuation, write conditionals like

if cur_char == ',' or cur_char =='.':
   #do action to start new "word" automatically