Using the POS Tagger of Stanford NPL .NET, I'm trying to extract a detailed list of part of speech tags per sentence.
e.g: "Have a look over there. Look at the car!"
Have/VB a/DT look/NN over/IN there/RB ./. Look/VB at/IN the/DT car/NN !/.
I need:
- POS Text: "Have"
- POS tag: "VB"
- Position in the original text
I managed to achieve this by accessing the private fields of the result via reflection.
I know it's ugly, not efficient and very bad, but that's the only I found until know. Hence my question; is there any built-in way to access such information?
using (var streamReader = new StringReader(rawText))
{
var tokenizedSentences = MaxentTagger.tokenizeText(streamReader).toArray();
foreach (ArrayList tokenizedSentence in tokenizedSentences)
{
var taggedSentence = _posTagger.tagSentence(tokenizedSentence).toArray();
for (int index = 0; index < taggedSentence.Length; index++)
{
var partOfSpeech = ((StringLabel) (taggedSentence[index]));
var posText = partOfSpeech.value();
var posTag = ReflectionHelper.GetInstanceField(typeof (TaggedWord), partOfSpeech, "tag") as string;
var posBeginPosition = (int)ReflectionHelper.GetInstanceField(typeof (StringLabel), partOfSpeech, "beginPosition");
var posEndPosition = (int)ReflectionHelper.GetInstanceField(typeof (StringLabel), partOfSpeech, "endPosition");
// process the pos
}
}
ReflectionHelper:
public static object GetInstanceField<T>(T instance, string fieldName)
{
const BindingFlags bindFlags = BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic | BindingFlags.Static;
object result = null;
var field = typeof(T).GetField(fieldName, bindFlags);
if (field != null)
{
result = field.GetValue(instance);
}
return result;
}
The solution is pretty easy. Just cast the part of speech (taggedSentence[index]) to a TaggedWord. You can then easily access these properties from the getters beginPosition(), endPosition(), tag() and value().