Issues tokenizing text

2019-08-17 08:49发布

问题:

Started text analysing, and eventually ran into a need for downloading Corpora in using PyCharm2019 as IDE. Not really sure what traceback message wants me to do, since I used PyCharm's own lib import interface to enable Corpora already. Why does an error stating that Corpora is not available to the code keep reappearing?

Imported TextBlob, tried to do a line like: from textblob import TextBlob...view code below

from textblob import TextBlob

TextBlob(train['tweet'][1]).words

print("\nPRINT TOKENIZATION") # own instruction to allow for knowing what code result delivers

print(TextBlob(train['tweet'][1]).words)

….

Tried to install via nltk, no luck...error when downloading 'brown.tei'

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml Exception in Tkinter callback Traceback (most recent call last): File "C:\Users\jcst\AppData\Local\Programs\Python\Python37-32\lib\tkinter__init__.py", line 1705, in call return self.func(*args) File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\nltk\downloader.py", line 1796, in _download return self._download_threaded(*e) File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\nltk\downloader.py", line 2082, in _download_threaded assert self._download_msg_queue == [] AssertionError Traceback (most recent call last): File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\textblob\decorators.py", line 35, in decorated return func(*args, **kwargs) File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\textblob\tokenizers.py", line 57, in tokenize return nltk.tokenize.sent_tokenize(text) File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\nltk\tokenize__init__.py", line 104, in sent_tokenize tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language)) File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\nltk\data.py", line 870, in load opened_resource = _open(resource_url)


Resource File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\nltk\data.py", line 995, in open punkt not found. Please use the NLTK Downloader to obtain the resource: return find(path, path + ['']).open()

File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\nltk\data.py", line 701, in find

import nltk nltk.download('punkt')

For more information see: https://www.nltk.org/data.html

Attempted to load tokenizers/punkt/english.pickle

Searched in: - 'C:\Users\jcst/nltk_data' - 'C:\Users\jcst\PycharmProjects\TextMining\venv\nltk_data' - 'C:\Users\jcst\PycharmProjects\TextMining\venv\share\nltk_data' - 'C:\Users\jcst\PycharmProjects\TextMining\venv\lib\nltk_data' - 'C:\Users\jcst\AppData\Roaming\nltk_data' - 'C:\nltk_data' - 'D:\nltk_data' - 'E:\nltk_data' - ''


raise LookupError(resource_not_found)

LookupError:


Resource punkt not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download('punkt')

For more information see: https://www.nltk.org/data.html

Attempted to load tokenizers/punkt/english.pickle

Searched in: - 'C:\Users\jcst/nltk_data' - 'C:\Users\jcst\PycharmProjects\TextMining\venv\nltk_data' - 'C:\Users\jcst\PycharmProjects\TextMining\venv\share\nltk_data' - 'C:\Users\jcst\PycharmProjects\TextMining\venv\lib\nltk_data' - 'C:\Users\jcst\AppData\Roaming\nltk_data' - 'C:\nltk_data' - 'D:\nltk_data' - 'E:\nltk_data' - ''


During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:/Users/jcst/PycharmProjects/TextMining/ModuleImportAndTrainFileIntro.py", line 151, in TextBlob(train['tweet'][1]).words File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\textblob\decorators.py", line 24, in get value = obj.dict[self.func.name] = self.func(obj) File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\textblob\blob.py", line 649, in words return WordList(word_tokenize(self.raw, include_punc=False)) File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\textblob\tokenizers.py", line 73, in word_tokenize for sentence in sent_tokenize(text)) File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\textblob\base.py", line 64, in itokenize return (t for t in self.tokenize(text, *args, **kwargs)) File "C:\Users\jcst\PycharmProjects\TextMining\venv\lib\site-packages\textblob\decorators.py", line 38, in decorated raise MissingCorpusError() textblob.exceptions.MissingCorpusError: Looks like you are missing some required data for this feature.

To download the necessary data, simply run

python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.

标签: text mining