Python NLP Text Tokenization based on custom regex

2020-05-07 19:31发布

问题:

I am processing large amount of text for custom (NER) Named Entity Recognition using Spacy. For text pre-processing I am using nltk for tokenization..etc.

I am able to process one of my custom entities which is based on simple strings. But the other custom entity is a combination of number and certain text (20 BBLs for example). The word_tokenize method from nltk.tokenize tokenizes 20 and 'BBLs' separately each as a separate token. What I want is to treat them (the number and the 'BBLs' string) as one token.

I am able to extract all the occurrences of this using regex:

re.findall(r'.\d+\s+BBL', Text)

Note: I am doing that because Spacy standard English NER model is mistakenly recognizing that as 'Money' or 'Cardinal' named entities. So I want it to re-train my custom model, so I need to feed it with this pattern (the number and the 'BBLs' string) as one token that indicates my custom entity.