How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.

Update: Just to answer one of the comment. I am more interested in Text Information Extraction.

标签： math machine-learning nlp information-extraction

8条回答

该账号已被封号

2楼-- · 2019-03-11 10:28

The Wikipedia Information Extraction article is a quick introduction.

At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.

0人赞添加讨论(0) 举报

我命由我不由天

3楼-- · 2019-03-11 10:29

This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

0人赞添加讨论(0) 举报

走好不送

4楼-- · 2019-03-11 10:35

I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).

0人赞添加讨论(0) 举报

该账号已被封号

5楼-- · 2019-03-11 10:42

Just to answer one of the comment. I am more interested in Text Information Extraction.

Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from textual information, and apply training, scoring, or classification. Good introductionary books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).

Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from plain text). You can use Wikipedia as a training corpus, since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.

The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and very good motivator -maybe you could reimplement some of their results as a learning exercise.

As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task are usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high quality info is currently in obscure white papers (Google scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.

0人赞添加讨论(0) 举报

Rolldiameter

6楼-- · 2019-03-11 10:42

You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a phd paper on building a new machine learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.

A standard process model you can follow to do your extraction is to adapt a data/text mining approach:

pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data segmentation/classification/clustering/association - your blackbox where most of your extraction work will be done post-processing - cleansing your data back to where you want to store it or represent it as information

Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.

standard steps for: input->process->output

If you are using Java/C++ there are loads of frameworks and libraries available you can work with. Perl would be an excellent language to do your NLP extraction work with if you want to do alot of standard text extraction.

You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.

Good sources to read are:

Handbook of Compuational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal

Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say reuters corpus, tipster, TREC, etc. You can even check out alchemyapi, GATE, UIMA, OpenNLP, etc.

Building extractions from standard text is easier then say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.

Standard measures include: precision, recall, f1 measure amongst others.

0人赞添加讨论(0) 举报

The star\"

7楼-- · 2019-03-11 10:51

I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.

0人赞添加讨论(0) 举报

1 2 下一页

How to get started on Information Extraction?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间