I just started learning machine learning, when practicing one of the task, I am getting value error, but I followed the same steps as the instructor does.
I am getting value error, please help.
dff
Country Name
0 AUS Sri
1 USA Vignesh
2 IND Pechi
3 USA Raj
First I performed labelencoding,
X=dff.values
label_encoder=LabelEncoder()
X[:,0]=label_encoder.fit_transform(X[:,0])
out:
X
array([[0, 'Sri'],
[2, 'Vignesh'],
[1, 'Pechi'],
[2, 'Raj']], dtype=object)
then performed One hot encoding for the same X
onehotencoder=OneHotEncoder( categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
I am getting the below error:
ValueError Traceback (most recent call last)
<ipython-input-472-be8c3472db63> in <module>()
----> 1 X=onehotencoder.fit_transform(X).toarray()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
1900 """
1901 return _transform_selected(X, self._fit_transform,
-> 1902 self.categorical_features, copy=True)
1903
1904 def _transform(self, X):
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
1695 X : array or sparse matrix, shape=(n_samples, n_features_new)
1696 """
-> 1697 X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
1698
1699 if isinstance(selected, six.string_types) and selected == "all":
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: could not convert string to float: 'Raj'
Please edit my question is anything wrong, thanks in advance!
Below implementation should work well. Note that the input of onehotencoder
fit_transform
must not be 1-rank array and also output is sparse and we have usedto_array()
to expand it.You can go directly to OneHotEncoding now without using the LabelEncoder, and as we move toward version 0.22 many might want to do things this way to avoid warnings and potential errors (see DOCS and EXAMPLES).
Example code 1 where ALL columns are encoded and where the categories are explicitly specified:
Output for code example 1:
Example code 2 showing the 'auto' option for specification of categories:
The first 3 columns encode the country names, the last four the personal names.
Output for code example 2 (same as for 1):
Example code 3 where only the first column is one hot encoded:
Now, here's the unique part. What if you only need to One Hot Encode a specific column for your data?
(Note: I've left the last column as strings for easier illustration. In reality it makes more sense to do this WHEN the last column was already numerical).
Output for code example 3:
An alternative if you do want to encode multiple categorical features is to use a Pipeline with a FeatureUnion and a couple custom Transformers.
First need two transformers - one for selecting a single column and one for making LabelEncoder usable in a Pipeline (The fit_transform method only takes X, it needs to take an optional y to work in a Pipeline).
Next create a Pipeline (or just a FeatureUnion) which has 2 branches - one for each of the categorical columns. Within each select 1 column, encode the labels and then one hot encode.
Finally run your full dataframe through the Pipeline - it will one hot encode each column separately and concatenate at the end.
Which returns (first 3 columns are the countries, second 4 are the names)