NLTK regexp tokenizer not playing nice with decima

2019-04-10 10:05发布

问题:

I'm trying to write a text normalizer, and one of the basic cases that needs to be handled is turning something like 3.14 to three point one four or three point fourteen.

I'm currently using the pattern \$?\d+(\.\d+)?%? with nltk.regexp_tokenize, which I believe should handle numbers as well as currency and percentages. However, at the moment, something like $23.50 is handled perfectly (it parses to ['$23.50']), but 3.14 is parsing to ['3', '14'] - the decimal point is being dropped.

I've tried adding a pattern separate \d+.\d+ to my regexp, but that didn't help (and shouldn't my current pattern match that already?)

Edit 2: I also just discovered that the % part doesn't seem to be working correctly either - 20% returns just ['20']. I feel like there must be something wrong with my regexp, but I've tested it in Pythex and it seems fine?

Edit: Here is my code.

import nltk
import re

pattern = r'''(?x)    # set flag to allow verbose regexps
            ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
            | \w+([-']\w+)*        # words w/ optional internal hyphens/apostrophe
            | \$?\d+(\.\d+)?%?  # numbers, incl. currency and percentages
            | [+/\-@&*]         # special characters with meanings
            '''
    words = nltk.regexp_tokenize(line, pattern)
    words = [string.lower(w) for w in words]
    print words

Here are some of my test strings:

32188
2598473
26 letters from A to Z
3.14 is pi.                         <-- ['3', '14', 'is', 'pi']
My weight is about 68 kg, +/- 10 grams.
Good muffins cost $3.88 in New York <-- ['good', 'muffins', 'cost', '$3.88', 'in', 'new', 'york']

回答1:

The culprit is:

\w+([-']\w+)*

\w+ will match numbers and since there's no . there, it will match only 3 in 3.14. Move the options around a bit so that \$?\d+(\.\d+)?%? is before the above regex part (so that the match is attempted first on the number format):

(?x)([A-Z]\.)+|\$?\d+(\.\d+)?%?|\w+([-']\w+)*|[+/\-@&*]

regex101 demo

Or in expanded form:

pattern = r'''(?x)               # set flag to allow verbose regexps
              ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
              | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
              | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
              | [+/\-@&*]        # special characters with meanings
            '''


回答2:

Try this regex:

\b\$?\d+(\.\d+)?%?\b

I surround the initial regex with word boundaries matching: \b.