Finding dictionary words

I have a lot of compound strings that are a combination of two or three English words.

    e.g. "Spicejet" is a combination of the words "spice" and "jet"

I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.

What would be the most efficient by which I can separate individual English words from such compound strings.

标签： algorithm data-structures text-processing

10条回答

劳资没心，怎么记你

2楼-- · 2019-01-17 02:23

I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.

You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.

I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.

You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.

Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.

Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.

Sounds like a fun problem, good luck!

0人赞添加讨论(0) 举报

三岁会撩人

3楼-- · 2019-01-17 02:25

So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.

Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:

accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes

Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:

from __future__ import with_statement

def opendict(dictionary=r"g:\words\words(3).txt"):
    with open(dictionary, "r") as f:
        return set(line.strip() for line in f)

if __name__ == '__main__':
    s = opendict()
    for word in sorted(s):
        if len(word) >= 10:
            for i in range(4, len(word)-4):
                left, right = word[:i], word[i:]
                if (left in s) and (right in s):
                    if right not in ('nesses', ):
                        print word, left, right

0人赞添加讨论(0) 举报

三岁会撩人

4楼-- · 2019-01-17 02:28

Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:

# python-ish pseudocode
def splitword(word):
    # word is a character array indexed from 0..n-1

    for i from 1 to n-1:
        head = word[:i]  # first i characters
        tail = word[i:]  # everything else

        if is_word(head):
            if i == n-1:
                return [head]   # this was the only valid word; return it as a 1-element list
            else:
                rest = splitword(tail)
                if rest != []:   # check whether we successfully split the tail into words
                    return [head] + rest

    return []  # No successful split found, and 'word' is not a word.

Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.

Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.

0人赞添加讨论(0) 举报

放荡不羁爱自由

5楼-- · 2019-01-17 02:33

This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).

We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:

ethylmethylketone

where the morphemes are:

ethyl methyl and ketone

We tackle this through automata and maximum entropy and the code is available on Sourceforge

http://www.sf.net/projects/oscar3-chem

but be warned that it will take some work.

We sometimes encounter ambiguity and are still finding a good way of reporting it.

To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.

As another example the string

weeknight

can be parsed as

wee knight

week night

Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).

0人赞添加讨论(0) 举报

男人必须洒脱

6楼-- · 2019-01-17 02:34

It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.

A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).

0人赞添加讨论(0) 举报

smile是对你的礼貌

7楼-- · 2019-01-17 02:39

A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).

0人赞添加讨论(0) 举报

1 2 下一页

Finding dictionary words

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间