here is sample of the text file I am working with:
<Opera>
Tristan/NNP
and/CC
Isolde/NNP
and/CC
the/DT
fatalistic/NN
horns/VBZ
The/DT
passionate/JJ
violins/NN
And/CC
ominous/JJ
clarinet/NN
;/:
The capital letters after the forward slashes are weird tags. I want to be able to search the file for something like "NNP,CC,NNP"
and have the program return for this segment "Tristan and Isolde"
, the three words in a row that match those three tags in a row.
The problem I am having is I want the search string to be user inputed so it will always be different.
I can read the file and find one match but I do not know how to count backwards from that point to print the first word or how to find whether the next tag matches.
It appears your source text was possibly produced by Natural Language Toolkit (nltk).
Using nltk, you could tokenize the text, split the token into (word, part_of_speech) tuples, and iterate through ngrams to find those that match the pattern:
import nltk
pattern = 'NNP,CC,NNP'
pattern = [pat.strip() for pat in pattern.split(',')]
text = '''Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ
The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN ;/:'''
tagged_token = [nltk.tag.str2tuple(word) for word in nltk.word_tokenize(text)]
for ngram in nltk.ingrams(tagged_token,len(pattern)):
if all(gram[1] == pat for gram,pat in zip(ngram,pattern)):
print(' '.join(word for word, pos in ngram))
yields
Tristan and Isolde
Related link:
- Categorizing and Tagging Words (chapter 5 of the NLTK book)
Build a regular expression dynamically from a list of tags you want to search:
text = ("Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ "
"The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN")
tags = ["NNP", "CC", "NNP"]
tags_pattern = r"\b" + r"\s+".join(r"(\w+)/{0}".format(tag) for tag in tags) + r"\b"
# gives you r"\b(\w+)/NNP\s+(\w+)/CC\s+(\w+)/NNP\b"
from re import findall
print(findall(tags_pattern, text))
>>> import re
>>> s = "Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN ;/:"
>>> re.findall("(\w+)/NNP (\w+)/CC (\w+)/NNP", s)
[('Tristan', 'and', 'Isolde')]
Similarly, you can do what you need.
EDIT: More generalized.
>>> import re
>>> pattern = 'NNP,CC,NNP'
>>> pattern = pattern.split(",")
>>> p = ""
>>> for i in pattern:
... p = p + r"(\w+)/"+i+ r"\n"
>>> f = open("yourfile", "r")
>>> s = f.read()
>>> f.close()
>>> found = re.findall(p, s, re.MULTILINE)
>>> found #Saved in found
[('Tristan', 'and', 'Isolde')]
>>> found_str = " ".join(found[0]) #Converted to string
>>> f = open("written.txt", "w")
>>> f.write(found_str)
>>> f.close()