how to speed up NE recognition with stanford NER w

2019-02-07 04:59发布

问题:

First I tokenize the file content into sentences and then call Stanford NER on each of the sentences. But this process is really slow. I know if I call it on the whole file content if would be faster, but I'm calling it on each sentence as I want to index each sentence before and after NE recognition.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
    sentences = sent_tokenize(filecontent) #break file content into sentences
    for j,sent in enumerate(sentences): 
        words = word_tokenize(sent) #tokenize sentences into words
        ne_tags = st.tag(words) #get tagged NEs from Stanford NER

This is probably due to calling st.tag() for each sentence, but is there any way to make it run faster?

EDIT

The reason that I want to tag sentences separate is that I want to write sentences to a file (like sentence indexing) so that given the ne tagged sentence at a later stage, i can get the unprocessed sentence (i'm also doing lemmatizing here)

file format:

(sent_number, orig_sentence, NE_and_lemmatized_sentence)

回答1:

From StanfordNERTagger, there is the tag_sents() function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)


回答2:

you can use stanford ner server. The speed will be much faster.

install sner

pip install sner

run ner server

cd your_stanford_ner_dir
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz

from sner import Ner

test_string = "Alice went to the Museum of Natural History."
tagger = Ner(host='localhost',port=9199)
print(tagger.get_entities(test_string))

this code result is

[('Alice', 'PERSON'),
 ('went', 'O'),
 ('to', 'O'),
 ('the', 'O'),
 ('Museum', 'ORGANIZATION'),
 ('of', 'ORGANIZATION'),
 ('Natural', 'ORGANIZATION'),
 ('History', 'ORGANIZATION'),
 ('.', 'O')]

more detail to look https://github.com/caihaoyu/sner



回答3:

First download Stanford CoreNLP 3.5.2 from here: http://nlp.stanford.edu/software/corenlp.shtml

Lets say you put the download at /User/username/stanford-corenlp-full-2015-04-20

This Python code will run the pipeline:

stanford_distribution_dir = "/User/username/stanford-corenlp-full-2015-04-20"
list_of_sentences_path = "/Users/username/list_of_sentences.txt"
stanford_command = "cd %s ; java -Xmx2g -cp \"*\" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ssplit.eolonly -filelist %s -outputFormat json" % (stanford_distribution_dir, list_of_sentences_path)
os.system(stanford_command)

Here is some sample Python code for loading in a .json file for reference:

import json
sample_json = json.loads(file("sample_file.txt.json").read()

At this point sample_json will be a nice dictionary with all the sentences from the file in it.

for sentence in sample_json["sentences"]:
  tokens = []
  ner_tags = []
  for token in sentence["tokens"]:
    tokens.append(token["word"])
    ner_tags.append(token["ner"])
  print (tokens, ner_tags)

list_of_sentences.txt should be your list of files with sentences, something like:

input_file_1.txt
input_file_2.txt
...
input_file_100.txt

So input_file.txt (which should have one sentence per line) will generate input_file.txt.json once the Java command is run and that .json files will have the NER tags. You can just load the .json for each input file and easily get (sentence, ner tag sequence) pairs. You can experiment with "text" as an alternative output format if you like that better. But "json" will create a nice .json file that you can load in with json.loads(...) and then you'll have a nice dictionary that you can use to access the sentences and annotations.

This way you'll only load the pipeline once for all the files.