可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Reading the documentation of nltk i found that is possible to extract tuples with str2tuple(). As an instance assume i have the following sentence(clearly is a much larger file):

sent = "pero pero CC " \
        "tan tan RG " \
        "antigua antiguo AQ0FS0 " \
        "que que CS " \
        "según según SPS00 " \
        "mi mi  DP1CSS " \
        "madre madre NCFS000"

I would like to extract a list of tuples, e.g.:

> ([antigua, AQ0FS0],[madre, NCFS000])

The female adjective tag (AQ0FS0) and the female noun tag (NCFS000). Is this possible with str2tuple() or a better aproach could be using a regular expression?

This is what i have tried:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nltk as nl

sent = "pero pero CC " \
              "tan tan RG " \
              "antigua antiguo AQ0FS0 " \
              "que que CS " \
              "según según SPS00 " \
              "mi mi  DP1CSS " \
              "madre madre NCFS000"

nl.tag.str2tuple(t) for t in sent.split()

回答1:

I think what you have is a verticalized text file, aka as .vrt , see CWB encoding Corpus

I guess the first column means the surface form of the word, the second refers to some sort of lemma and the third is the part-of-speech text.

First take a look at csv module, i find this tutorial helpful, http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/

Let's say you have a tab-delimited file as such:

pero    pero    CC
tan tan RG
antigua antiguo AQ0FS0
que que CS
según   según   SPS00
mi  mi  DP1CSS
madre   madre   NCFS000

To read the file, sometimes people call it "to parse the file":

import csv

with open('test.txt', 'r') as fin:
    reader = csv.reader(fin, delimiter='\t')
    for line in reader:
        word, lemma, pos = line
        print word, lemma, pos

To get the (word,pos) tuple structure for the sentence, try:

import csv
sentences = []
with open('test.txt', 'r') as fin:
    reader = csv.reader(fin, delimiter='\t')
    for line in reader:
        word, lemma, pos = line
        sentences.append((word, pos))

print sentences

[out]:

[('pero', 'CC'), ('tan', 'RG'), ('antigua', 'AQ0FS0'), ('que', 'CS'), ('seg\xc3\xban', 'SPS00'), ('mi', 'DP1CSS'), ('madre', 'NCFS000')]

回答2:

Since you're presumably interested in using your corpus with the NLTK: Assuming your file is stored in this format, you should read it in, parse it (using str2tuple or other simpler methods) and load it with TaggedCorpusReader. Then you can use all the standard NLTK corpus functions with it. You basically have two types of tags, part of speech and (presumably) word lemma. If this is what you're after, I can add more specific information to this answer.

Assuming your string actually includes a newline after each triple, the easy way to parse it into a list of tuples is like this:

sent = """pero pero CC
tan tan RG
antigua antiguo AQ0FS0
que que CS
según según SPS00
mi  mi DP1CSS
madre madre NCFS000"""

tuples = [ line.split() for line in sent.splitlines() ]

A detail: split() actually returns a list, not a tuple. If you need to use them as dictionary keys, replace line.split() with tuple(line.split()).