I'm using my own data to classify into two categories some data, so let:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Load the text data
categories = [
'CLASS_1',
'CLASS_2',
]
text_train_subset = load_files('train',
categories=categories)
text_test_subset = load_files('test',
categories=categories)
# Turn the text documents into vectors of word frequencies
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(text_train_subset)
y_train = text_train_subset.target
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
# Evaluate the classifier on the testing set
X_test = vectorizer.transform(text_test_subset.data)
y_test = text_test_subset.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
For the above code and the documentation, I have the following directory schema:
data_folder/
train_folder/
CLASS_1.txt CLASS_2.txt
test_folder/
test.txt
Then I get this error:
% (size, n_samples))
ValueError: Found array with dim 0. Expected 5
I also tried fit_transform but still the same. How can I solve this dimession problem?
The first problem is you've got the wrong directory structure. You need it to be like
container_folder/
CLASS_1_folder/
file_1.txt, file_2.txt ...
CLASS_2_folder/
file_1.txt, file_2.txt, ....
You need to have both the train and test set in this directory structure. Alternatively, you can have all data in one directory and use train_test_split to split it in two.
Secondly,
X_train = vectorizer.fit_transform(text_train_subset)
needs to be
X_train = vectorizer.fit_transform(text_train_subset.data) # added .data
Here is a complete and working example:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
text_train_subset = load_files('sample-data/web')
text_test_subset = text_train_subset # load your actual test data here
# Turn the text documents into vectors of word frequencies
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(text_train_subset.data)
y_train = text_train_subset.target
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
# Evaluate the classifier on the testing set
X_test = vectorizer.transform(text_test_subset.data)
y_test = text_test_subset.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
The directory structure of sample-data/web
is
sample-data/web
├── de
│ ├── apollo8.txt
│ ├── fiv.txt
│ ├── habichtsadler.txt
└── en
├── elizabeth_needham.txt
├── equipartition_theorem.txt
├── sunderland_echo.txt
└── thespis.txt