Some NLP stuff to do with grammar, tagging, stemmi

2019-03-11 13:00发布

问题:

Background (TLDR; provided for the sake of completion)

Seeking advice on an optimal solution to an odd requirement. I'm a (literature) student in my fourth year of college with only my own guidance in programming. I'm competent enough with Python that I won't have trouble implementing solutions I find (most of the time) and developing upon them, but because of my newbness, I'm seeking advice on the best ways I might tackle this peculiar problem.

Already using NLTK, but differently from the examples in the NLTK book. I'm already utilizing a lot of stuff from NLTK, particularly WordNet, so that material is not foreign to me. I've read most of the NLTK book. I'd know better how to proceed if I were trying to analyze existing texts or if the target texts were prose-like -- but my application is focused on poetry, particularly on building poetic texts on-the-fly, based on unforeseeable inputs from users.

I'm working with fragmentary, atomic language. My application moves word-by-word: each round, several users put in words (one word per user). My program seeks to unify or combine these input words to produce a single output word. I've developed the word-selection algorithm already -- it utilizes various features of WordNet to come up with its single-word result. The result is in the form of a WordNet synset -- a uninflected word (stripped of plurality and tense). The result gets appended to the text of the "poem" (after some whitespace). The addition of the resulting word influences the users' choices of what word to throw into the pot next, and that's how this game/program moves along, adding one machine-morphed word to the poem at a time.

The problem: How to inflect the result in a grammatically sensible way? Without any kind of grammatical processing, the results are just a list of dictionary-searchable words, without agreement between words. First step is for my application to stem/pluralize/conjugate/inflect root-words according to context. (The "root words" I'm speaking of are synsets from WordNet and/or their human-readable equivalents.) Imagining that there were already some grammatically sensible text in the poem to start off with, my application needs to inflect a new result-word to agree with the existing sequence. It's fine if this is only working on like a 3-word window or something, but I'm looking for advice on an optimal order of operations. I'm hoping that somebody can give me some pointers (I expect it to be difficult to implement, but I want to make sure I'm starting off with the right ideas).

Example scenario (less context more question)

Let's assume we already have a chunk of a poem, to which users are adding new inputs to. The new results need to be inflected in a grammatically sensible way.

The river bears no empty bottles, sandwich papers,   
Silk handkerchiefs, cardboard boxes, cigarette ends  
Or other testimony of summer nights. The nymphs

Let's say my algorithm has taken a batch of inputs from users, and now needs to print 1 of the 4 possible next words/synsets (informally represented): ['departure', 'to have', 'blue', 'quick']. It seems to me that 'blue' should be discarded; 'The nymphs blue' seems grammatically odd/unlikely. From there it could use either of these verbs.

If it picks 'to have' the result could be sensibly inflected as 'had', 'have', 'having', 'will have', 'would have', etc. (but not 'has'). (The resulting line would be something like 'The nymphs have' and the sensibly-inflected result will provide better context for future results ...)

I'd like for 'depature' to be a valid possibility in this case; while 'The nymphs departure' doesn't make sense (it's not "nymphs'"), 'The nymphs departed' (or other verb conjugations) would.

Seemingly 'The nymphs quick' wouldn't make sense, but something like 'The nymphs quickly [...]' or 'The nymphs quicken' could, so 'quick' is also a possibility for sensible inflection.

Breaking down the tasks

  1. Tag part of speech, plurality, tense, etc. -- of original inputs. Taking note of this could help to select from the several possibilities (i.e. choosing between had/have/having could be more directed than random if a user had inputted 'having' rather than some other tense). I've heard the Stanford POS tagger is good, which has an implementation in NLTK. I am not sure how to handle tense detection here.
  2. Consider context in order to rule out grammatically peculiar possibilities. Consider the last couple words and their part-of-speech tags (and tense?), as well as sentence boundaries if any, and from that, drop things that wouldn't make sense. After 'The nymphs' we don't want another article (or determiner, as far as I can tell), nor an adjective, but an adverb or verb could work. Comparison of the current stuff with sequences in tagged corpora (and/or Markov chains?) -- or consultation of grammar-checking functions -- could provide a solution for this.
  3. Select a word from the remaining possibilities (those that could be inflected sensibly). This isn't something I need an answer for -- I've got my methods for this. Let's say it's randomly selected.
  4. Transform the selected word as needed. If the information from #1 can be folded in (for example, perhaps the "pluralize" flag was set to True), do so. If there are several possibilities (e.g. picked word is a verb, but a few tenses are possible) select, randomly. Regardless I'm going to need to morph the word before inserting it into the "poem".

I'm looking for advice on the soundness of this routine, as well as suggestions for steps to add. Ways to break down these steps further would also be helpful. Finally I'm looking for suggestions on what tool might best accomplish each task.

I've tried to be as concise as possible, while providing enough information. Please don't hesitate to ask me for clarification! I'll appreciate any information I get, and I'll accept the clearest / most illuminating answer :) Thanks!

回答1:

I think that the comment above on n-gram language model fits your requirements better than parsing and tagging. Parsers and taggers (unless modified) will suffer from the lack of right context of the target word (i.e., you don't have the rest of the sentence available at time of query). On the other hand, language models consider the past (left context) efficiently, especially for windows up to 5 words. The problem with n-grams is that they don't model long distance dependencies (more than n words).

NLTK has a language model: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.model.ngram-pysrc.html . A tag lexicon may help you smooth the model more.

The steps as I see them: 1. Get a set of words from the users. 2. Create a larger set of all possible inflections of the words. 3. Ask the model which inflected word is most probable.