I am new to Sagemaker and not sure how to classify the text input in AWS sagemaker,
Suppose I have a Dataframe having two fields like 'Ticket' and 'Category', Both are text input, Now I want to split it test and training set and upload in Sagemaker training model.
X_train, X_test, y_train, y_test = model_selection.train_test_split(fewRecords['Ticket'],fewRecords['Category'])
Now as I want to perform TD-IDF feature extraction and then convert it to numeric value, so performing this operation
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(fewRecords['Category'])
xtrain_tfidf = tfidf_vect.transform(X_train)
xvalid_tfidf = tfidf_vect.transform(X_test)
When I want to upload the model in Sagemaker so I can perform next operation like
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
buf.seek(0)
I am getting this error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-8055e6cdbf34> in <module>()
1 buf = io.BytesIO()
----> 2 smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
3 buf.seek(0)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
98 raise ValueError("Label shape {} not compatible with array shape {}".format(
99 labels.shape, array.shape))
--> 100 resolved_label_type = _resolve_type(labels.dtype)
101 resolved_type = _resolve_type(array.dtype)
102
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
205 elif dtype == np.dtype('float32'):
206 return 'Float32'
--> 207 raise ValueError('Unsupported dtype {} on array'.format(dtype))
ValueError: Unsupported dtype object on array
Other than this exception, I am not clear if this is right way as TfidfVectorizer convert the series to Matrix.
The code is predicting fine on my local machine but not sure how to do the same on Sagemaker, All the example mentioned there are too lengthy and not for the person who still reached to SciKit Learn