Preprocessing csv files to use with tflearn

2019-07-02 00:28发布


My question is about preprocessing csv files before inputing them into a neural network.

I want to build a deep neural network for the famous iris dataset using tflearn in python 3.


I'm using tflearn to load the csv file. However, the classes column of my data set has words such as iris-setosa, iris-versicolor, iris-virginica.

Nueral networks work only with numbers. So, I have to find a way to change the classes from words to numbers. Since it is a very small dataset, I can do it manually using Excel/text editor. I manually assigned numbers for different classes.

But, I can't possibly do it for every dataset I work with. So, I tried using pandas to perform one hot encoding.

preprocess_data = pd.read_csv("F:\Gautam\.....\Dataset\iris_data.csv")
preprocess_data = pd.get_dummies(preprocess_data)

But now, I can't use this piece of code:

data, labels = load_csv('filepath', categorical_labels=True,

'filepath' should only be a directory to the csv file, not any variable like preprocess_data.

Original Dataset:

     Sepal Length  Sepal Width  Petal Length  Petal Width  Class
89            5.5          2.5           4.0          1.3  iris-versicolor
85            6.0          3.4           4.5          1.6  iris-versicolor
31            5.4          3.4           1.5          0.4  iris-setosa
52            6.9          3.1           4.9          1.5  iris-versicolor
111           6.4          2.7           5.3          1.9  iris-virginica

Manually modified dataset:

     Sepal Length  Sepal Width  Petal Length  Petal Width  Class
89            5.5          2.5           4.0          1.3      1
85            6.0          3.4           4.5          1.6      1
31            5.4          3.4           1.5          0.4      0
52            6.9          3.1           4.9          1.5      1
111           6.4          2.7           5.3          1.9      2

Here's my code which runs perfectly, but, I have modified the dataset manually.

import numpy as np
import pandas as pd
import tflearn
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
from tflearn.data_utils import load_csv

data_source = 'F:\Gautam\.....\Dataset\iris_data.csv'

data, labels = load_csv(data_source, categorical_labels=True,

network = input_data(shape=[None, 4], name='InputLayer')

network = fully_connected(network, 9, activation='sigmoid', name='Hidden_Layer_1')

network = fully_connected(network, 3, activation='softmax', name='Output_Layer')

network = regression(network, batch_size=1, optimizer='sgd', learning_rate=0.2)

model = tflearn.DNN(network), labels, show_metric=True, run_id='iris_dataset', validation_set=0.1, n_epoch=2000)

I want to know if there's any other built-in function in tflearn (or in any other module, for that matter) that I can use to modify the value of my classes from words to numbers. I don't think manually modifying the datasets would be productive.

I'm a beginner in tflearn and neural networks also. Any help would be appreciated. Thanks.


Use label encoder from sklearn library:

from sklearn.preprocessing import LabelEncoder,OneHotEncoder

df = pd.read_csv('iris_data.csv',header=None)
df.columns=[Sepal Length,Sepal Width,Petal Length,Petal Width,Class]

print df.head(5)

if you want One-hot encoding then first you need to labelEncode then do OneHotEncoding :

print df.head(5)

These encoders first sort the words in alphabetical order then assign them labels. If you want to see which label is assigned to which class, do:

for k in list(enc.classes_) :
   print 'name ::{}, label ::{}'.format(k,enc.transform([k]))

If you want to save this dataframe as a csv file, do:



The simpliest solution is map by dict of all possible values:

df['Class'] = df['Class'].map({'iris-versicolor': 1, 'iris-setosa': 0, 'iris-virginica': 2})
print (df)
   Sepal Length  Sepal Width  Petal Length  Petal  Width  Class
0            89          5.5           2.5    4.0    1.3      1
1            85          6.0           3.4    4.5    1.6      1
2            31          5.4           3.4    1.5    0.4      0
3            52          6.9           3.1    4.9    1.5      1
4           111          6.4           2.7    5.3    1.9      2

If want generate dictionary by all unique values:

d = {v:k for k, v in enumerate(df['Class'].unique())}
print (d)
{'iris-versicolor': 0, 'iris-virginica': 2, 'iris-setosa': 1}

df['Class'] = df['Class'].map(d)
print (df)
   Sepal Length  Sepal Width  Petal Length  Petal  Width  Class
0            89          5.5           2.5    4.0    1.3      0
1            85          6.0           3.4    4.5    1.6      0
2            31          5.4           3.4    1.5    0.4      1
3            52          6.9           3.1    4.9    1.5      0
4           111          6.4           2.7    5.3    1.9      2