Parsing Meaning from Text

2019-02-01 02:25发布

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:

"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",

what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).

To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other question on the topic and I'm digging through those resources now.

7条回答
放荡不羁爱自由
2楼-- · 2019-02-01 02:56

Regular expressions can help in some scenario. Here is a detailed example: What’s the Most Mentioned Scanner on CNET Forum, which used a regular expression to find all mentioned scanners in CNET forum posts.

In the post, a regular expression as such was used:

(?i)((?:\w+\s\w+\s(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)\s(\w+\s){0,1}(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner))|(?:(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner)\s(\w+\s){1,2}(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)))

in order to match either of the following:

  • two words, then model number (including all-in-one), then “scanner”
  • “scanner”, then one or two words, then model number (including all-in-one)

As a result, the text extracted from the post was like,

  1. discontinued HP C9900A photo scanner
  2. scanning his old x-rays
  3. new Epson V700 scanner
  4. HP ScanJet 4850 scanner
  5. Epson Perfection 3170 scanner

This regular expression solution worked in a way.

查看更多
登录 后发表回答