First I tokenize the file content into sentences and then call Stanford NER on each of the sentences. But this process is really slow. I know if I call it on the whole file content if would be faster, but I'm calling it on each sentence as I want to index each sentence before and after NE recognition.
st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
sentences = sent_tokenize(filecontent) #break file content into sentences
for j,sent in enumerate(sentences):
words = word_tokenize(sent) #tokenize sentences into words
ne_tags = st.tag(words) #get tagged NEs from Stanford NER
This is probably due to calling st.tag()
for each sentence, but is there any way to make it run faster?
EDIT
The reason that I want to tag sentences separate is that I want to write sentences to a file (like sentence indexing) so that given the ne tagged sentence at a later stage, i can get the unprocessed sentence (i'm also doing lemmatizing here)
file format:
(sent_number, orig_sentence, NE_and_lemmatized_sentence)
First download Stanford CoreNLP 3.5.2 from here: http://nlp.stanford.edu/software/corenlp.shtml
Lets say you put the download at /User/username/stanford-corenlp-full-2015-04-20
This Python code will run the pipeline:
Here is some sample Python code for loading in a .json file for reference:
At this point sample_json will be a nice dictionary with all the sentences from the file in it.
list_of_sentences.txt should be your list of files with sentences, something like:
So input_file.txt (which should have one sentence per line) will generate input_file.txt.json once the Java command is run and that .json files will have the NER tags. You can just load the .json for each input file and easily get (sentence, ner tag sequence) pairs. You can experiment with "text" as an alternative output format if you like that better. But "json" will create a nice .json file that you can load in with json.loads(...) and then you'll have a nice dictionary that you can use to access the sentences and annotations.
This way you'll only load the pipeline once for all the files.
From StanfordNERTagger, there is the
tag_sents()
function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68you can use stanford ner server. The speed will be much faster.
install sner
run ner server
this code result is
more detail to look https://github.com/caihaoyu/sner