Reading the documentation of nltk i found that is possible to extract tuples with str2tuple()
. As an instance assume i have the following sentence(clearly is a much larger file):
sent = "pero pero CC " \
"tan tan RG " \
"antigua antiguo AQ0FS0 " \
"que que CS " \
"según según SPS00 " \
"mi mi DP1CSS " \
"madre madre NCFS000"
I would like to extract a list of tuples, e.g.:
> ([antigua, AQ0FS0],[madre, NCFS000])
The female adjective tag (AQ0FS0)
and the female noun tag (NCFS000)
. Is this possible with str2tuple()
or a better aproach could be using a regular expression?
This is what i have tried:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import nltk as nl
sent = "pero pero CC " \
"tan tan RG " \
"antigua antiguo AQ0FS0 " \
"que que CS " \
"según según SPS00 " \
"mi mi DP1CSS " \
"madre madre NCFS000"
nl.tag.str2tuple(t) for t in sent.split()
I think what you have is a verticalized text file, aka as .vrt
, see CWB encoding Corpus
I guess the first column means the surface form of the word, the second refers to some sort of lemma and the third is the part-of-speech text.
First take a look at csv
module, i find this tutorial helpful, http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/
Let's say you have a tab-delimited file as such:
pero pero CC
tan tan RG
antigua antiguo AQ0FS0
que que CS
según según SPS00
mi mi DP1CSS
madre madre NCFS000
To read the file, sometimes people call it "to parse the file":
import csv
with open('test.txt', 'r') as fin:
reader = csv.reader(fin, delimiter='\t')
for line in reader:
word, lemma, pos = line
print word, lemma, pos
To get the (word,pos)
tuple structure for the sentence, try:
import csv
sentences = []
with open('test.txt', 'r') as fin:
reader = csv.reader(fin, delimiter='\t')
for line in reader:
word, lemma, pos = line
sentences.append((word, pos))
print sentences
[out]:
[('pero', 'CC'), ('tan', 'RG'), ('antigua', 'AQ0FS0'), ('que', 'CS'), ('seg\xc3\xban', 'SPS00'), ('mi', 'DP1CSS'), ('madre', 'NCFS000')]
Since you're presumably interested in using your corpus with the NLTK: Assuming your file is stored in this format, you should read it in, parse it (using str2tuple
or other simpler methods) and load it with TaggedCorpusReader
. Then you can use all the standard NLTK corpus functions with it. You basically have two types of tags, part of speech and (presumably) word lemma. If this is what you're after, I can add more specific information to this answer.
Assuming your string actually includes a newline after each triple, the easy way to parse it into a list of tuples is like this:
sent = """pero pero CC
tan tan RG
antigua antiguo AQ0FS0
que que CS
según según SPS00
mi mi DP1CSS
madre madre NCFS000"""
tuples = [ line.split() for line in sent.splitlines() ]
A detail: split()
actually returns a list, not a tuple. If you need to use them as dictionary keys, replace line.split()
with tuple(line.split())
.