I am planning to use Named Entity Recognition (NER) technique to identify person names (most of which are Indian names) from a given text. I have already explored the CRF-based NER model from Stanford NLP, however it is not quite accurate in recognizing Indian names. Hence I decided to create my own custom NER model via supervised training. I have a fair idea of how to create own NER model using the Stanford NER CRF, but creating a large training corpus with manual annotation is something I would like to avoid, as it is a humongous effort for an individual and secondly obtaining diverse people names from different states of India is also a challenge. Could anybody suggest any automation/programmatic way to prepare a labelled training corpus with at least 100k Indian names?
I have already looked into Facebook and LinkedIn API, but did not find a way to extract 100k number of user's full name from a given location (e.g. India).
相关问题
- How to get a list of antonyms lemmas using Python,
- facebook “could not retrieve data from URL”
- How to match dependency patterns with spaCy?
- LUIS - Can we use phrases list for new values in t
- JDK 11. javax.net.ssl.SSLPeerUnverifiedException:
相关文章
- Facebook login for group members
- The method FB.api will stop working when called fr
- What's the difference between WordNet 3.1 and
- How should I vectorize the following list of lists
- What created `maxent_treebank_pos_tagger/english.p
- How to fix 'Facebook has detected MyApp isn
- Extract poll results from Facebook Graph API
- Stanford Parser and NLTK windows
A proposition: you could try to exploite the India version of Wikipedia for training or to create automatically gazetteer.
I don't know if it is the efficient/quick solution but a lot of research exploits Wikipedia and his semi-structured content (for example, each page is annotated with several categories).
You can have a look at these articles to find an interesting idea for you: https://scholar.google.fr/scholar?q=named+entity+recognition+using+wikipedia&btnG=&hl=fr&as_sdt=0%2C5
This website has done this for us!It provides with the solution for these problems: Challenges in Indian Language NER Indian languages belong to several language families, the major ones being the Indo-European languages, Indo-Aryan and the Dravidian languages. The challenges in NER arise due to several factors. Some of the main factors are listed below Morphologically rich - identification of root is difficult, require use of morphological analysers No Capitalization feature - In English, capitalization is one of the main features, whereas that is not there in Indian languages Ambiguity - ambiguity between common and proper nouns. Eg: common words such as "Roja" meaning Rose flower is a name of a person Spell variations - In the web data is that we find different people spell the same entity differently - for example : In Tamil person name -Roja is spelt as "rosa", "roja". The whole corpus is provided.
Named Entity Recognition for Indian Languages and English
Best of luck for getting passwords for the zip files!
cheers!
I ended up doing the following to create NER model to identify Indian names. This may be useful for anybody looking for creating a custom NER model to recognize non-English person names, since most of the publicly available NER models such as the ones from Stanford NLP were trained with English names and hence are more accurate in identifying English (British/American) names.