I'm doing a multiclass text classification in Scikit-Learn. The dataset is being trained using the Multinomial Naive Bayes classifier having hundreds of labels. Here's an extract from the Scikit Learn script for fitting the MNB model
from __future__ import print_function
# Read **`file.csv`** into a pandas DataFrame
import pandas as pd
path = 'data/file.csv'
merged = pd.read_csv(path, error_bad_lines=False, low_memory=False)
# define X and y using the original DataFrame
X = merged.text
y = merged.grid
# split X and y into training and testing sets;
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# create document-term matrices using CountVectorizer
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
# import and instantiate MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# fit a Multinomial Naive Bayes model
nb.fit(X_train_dtm, y_train)
# make class predictions
y_pred_class = nb.predict(X_test_dtm)
# generate classification report
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred_class))
And a simplified output of the metrics.classification_report on command line screen looks like this:
precision recall f1-score support
12 0.84 0.48 0.61 2843
13 0.00 0.00 0.00 69
15 1.00 0.19 0.32 232
16 0.75 0.02 0.05 965
33 1.00 0.04 0.07 155
4 0.59 0.34 0.43 5600
41 0.63 0.49 0.55 6218
42 0.00 0.00 0.00 102
49 0.00 0.00 0.00 11
5 0.90 0.06 0.12 2010
50 0.00 0.00 0.00 5
51 0.96 0.07 0.13 1267
58 1.00 0.01 0.02 180
59 0.37 0.80 0.51 8127
7 0.91 0.05 0.10 579
8 0.50 0.56 0.53 7555
avg/total 0.59 0.48 0.45 35919
I was wondering if there was any way to get the report output into a standard csv file with regular column headers
When I send the command line output into a csv file or try to copy/paste the screen output into a spreadsheet - Openoffice Calc or Excel, It lumps the results in one column. Looking like this:
Help appreciated. Thanks!
As mentioned in one of the posts in here,
precision_recall_fscore_support
is analogous toclassification_report
.Then it suffices to use python library
pandas
to easily format the data in a columnar format, similar to whatclassification_report
does. Here is an example:With
classification_report
You'll get something like:Output:
Then with our custom funtion
pandas_classification_report
:Output:
Then just save it to csv format (refer to here for other separator formating like sep=';'):
I open
my_csv_file.csv
with LibreOffice Calc (although you could use any tabular/spreadsheet editor like excel):I have modified @kindjacket's answer. Try this:
You can just export that df to csv using pandas
The way I have always solved output problems is like what I've mentioned in my previous comment, I've converted my output to a DataFrame. Not only is it incredibly easy to send to files (see here), but Pandas is really easy to manipulate the data structure. The other way I have solved this is writing the output line-by-line using CSV and specifically using
writerow
.If you manage to get the output into a dataframe it would be
or if using CSV it would be something like the example they provide in the CSV link.
Along with example input-output, here's the other function metrics_report_to_df(). Implementing precision_recall_fscore_support from Sklearn metrics should do:
Another option is to calculate the underlying data and compose the report on your own. All the statistics you will get by
We can get the actual values from the precision_recall_fscore_support function and then put them into data frames. the below code will give the same result, but now in pandas df :).