I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to python.
I have a bunch of .txt
files and I want to be able to use the corpus functions that NLTK provides for the corpus nltk_data
.
I've tried PlaintextCorpusReader
but I couldn't get further than:
>>>import nltk
>>>from nltk.corpus import PlaintextCorpusReader
>>>corpus_root = './'
>>>newcorpus = PlaintextCorpusReader(corpus_root, '.*')
>>>newcorpus.words()
How do I segment the newcorpus
sentences using punkt? I tried using the punkt functions but the punkt functions couldn't read PlaintextCorpusReader
class?
Can you also lead me to how I can write the segmented data into text files?
Edit: This question had a bounty once, and it now has a second bounty. See text in bounty box.
I think the
PlaintextCorpusReader
already segments the input with a punkt tokenizer, at least if your input language is english.PlainTextCorpusReader's constructor
You can pass the reader a word and sentence tokenizer, but for the latter the default already is
nltk.data.LazyLoader('tokenizers/punkt/english.pickle')
.For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).
After some years of figuring out how it works, here's the updated tutorial of
How to create an NLTK corpus with a directory of textfiles?
The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it's best to use the PlaintextCorpusReader.
If you have a directory that looks like this:
Simply use these lines of code and you can get a corpus:
NOTE: that the
PlaintextCorpusReader
will use the defaultnltk.tokenize.sent_tokenize()
andnltk.tokenize.word_tokenize()
to split your texts into sentences and words and these functions are build for English, it may NOT work for all languages.Here's the full code with creation of test textfiles and how to create a corpus with NLTK and how to access the corpus at different levels:
Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output: