NER model to recognize Indian names

2019-04-13 16:46发布

问题:

I am planning to use Named Entity Recognition (NER) technique to identify person names (most of which are Indian names) from a given text. I have already explored the CRF-based NER model from Stanford NLP, however it is not quite accurate in recognizing Indian names. Hence I decided to create my own custom NER model via supervised training. I have a fair idea of how to create own NER model using the Stanford NER CRF, but creating a large training corpus with manual annotation is something I would like to avoid, as it is a humongous effort for an individual and secondly obtaining diverse people names from different states of India is also a challenge. Could anybody suggest any automation/programmatic way to prepare a labelled training corpus with at least 100k Indian names?
I have already looked into Facebook and LinkedIn API, but did not find a way to extract 100k number of user's full name from a given location (e.g. India).

回答1:

I ended up doing the following to create NER model to identify Indian names. This may be useful for anybody looking for creating a custom NER model to recognize non-English person names, since most of the publicly available NER models such as the ones from Stanford NLP were trained with English names and hence are more accurate in identifying English (British/American) names.

  1. Find an Indian celebrity with Twitter account and having a huge number of followers in Twitter (for my case, I chose Sachin Tendulkar).
  2. Create a program in the language of your choice to call the Twitter REST API (GET followers/list) to get the names of all the followers of the celebrity and save to a file. We can safely assume most of the followers would be Indians. Note that there is an API Rate Limit in place (30 requests per 15 minute window), so the program should be built in to handle that. For our case, we developed the program as a Windows Service which runs every 15 minutes.
  3. Since some Twitter users' names may not be valid person names, it is advisable to add some rule-based logic (like RegEx) to filter seemingly real names and add only those to the file.
  4. Once the file with real names is generated, create another program to create the training data file containing these names labelled/annotated as PERSON as well as non-entity names annotated as OTHER. If you are using Stanford NER CRF Classifier, the program should generate a training (TSV) file having two columns - one containing the word (token) and the second column mentioning the label.
  5. Once the training corpus is generated programmatically, you can follow the below link to create your custom NER model to recognize Indian names: http://nlp.stanford.edu/software/crf-faq.shtml#a


回答2:

This website has done this for us!It provides with the solution for these problems: Challenges in Indian Language NER Indian languages belong to several language families, the major ones being the Indo-European languages, Indo-Aryan and the Dravidian languages. The challenges in NER arise due to several factors. Some of the main factors are listed below Morphologically rich - identification of root is difficult, require use of morphological analysers No Capitalization feature - In English, capitalization is one of the main features, whereas that is not there in Indian languages Ambiguity - ambiguity between common and proper nouns. Eg: common words such as "Roja" meaning Rose flower is a name of a person Spell variations - In the web data is that we find different people spell the same entity differently - for example : In Tamil person name -Roja is spelt as "rosa", "roja". The whole corpus is provided.

Named Entity Recognition for Indian Languages and English

Best of luck for getting passwords for the zip files!

cheers!



回答3:

A proposition: you could try to exploite the India version of Wikipedia for training or to create automatically gazetteer.

I don't know if it is the efficient/quick solution but a lot of research exploits Wikipedia and his semi-structured content (for example, each page is annotated with several categories).

You can have a look at these articles to find an interesting idea for you: https://scholar.google.fr/scholar?q=named+entity+recognition+using+wikipedia&btnG=&hl=fr&as_sdt=0%2C5