I'm using my own data to classify into two categories some data, so let:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Load the text data
categories = [
'CLASS_1',
'CLASS_2',
]
text_train_subset = load_files('train',
categories=categories)
text_test_subset = load_files('test',
categories=categories)
# Turn the text documents into vectors of word frequencies
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(text_train_subset)
y_train = text_train_subset.target
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
# Evaluate the classifier on the testing set
X_test = vectorizer.transform(text_test_subset.data)
y_test = text_test_subset.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
For the above code and the documentation, I have the following directory schema:
data_folder/
train_folder/
CLASS_1.txt CLASS_2.txt
test_folder/
test.txt
Then I get this error:
% (size, n_samples))
ValueError: Found array with dim 0. Expected 5
I also tried fit_transform but still the same. How can I solve this dimession problem?
The first problem is you've got the wrong directory structure. You need it to be like
You need to have both the train and test set in this directory structure. Alternatively, you can have all data in one directory and use train_test_split to split it in two.
Secondly,
needs to be
Here is a complete and working example:
The directory structure of
sample-data/web
is