AWS Sagemaker | how to train text data | For ticke

2019-06-10 01:41发布

问题:

I am new to Sagemaker and not sure how to classify the text input in AWS sagemaker,

Suppose I have a Dataframe having two fields like 'Ticket' and 'Category', Both are text input, Now I want to split it test and training set and upload in Sagemaker training model.

X_train, X_test, y_train, y_test = model_selection.train_test_split(fewRecords['Ticket'],fewRecords['Category'])

Now as I want to perform TD-IDF feature extraction and then convert it to numeric value, so performing this operation

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(fewRecords['Category'])
xtrain_tfidf =  tfidf_vect.transform(X_train)
xvalid_tfidf =  tfidf_vect.transform(X_test)

When I want to upload the model in Sagemaker so I can perform next operation like

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
buf.seek(0)

I am getting this error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-8055e6cdbf34> in <module>()
      1 buf = io.BytesIO()
----> 2 smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
      3 buf.seek(0)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
     98             raise ValueError("Label shape {} not compatible with array shape {}".format(
     99                              labels.shape, array.shape))
--> 100         resolved_label_type = _resolve_type(labels.dtype)
    101     resolved_type = _resolve_type(array.dtype)
    102 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
    205     elif dtype == np.dtype('float32'):
    206         return 'Float32'
--> 207     raise ValueError('Unsupported dtype {} on array'.format(dtype))

ValueError: Unsupported dtype object on array

Other than this exception, I am not clear if this is right way as TfidfVectorizer convert the series to Matrix.

The code is predicting fine on my local machine but not sure how to do the same on Sagemaker, All the example mentioned there are too lengthy and not for the person who still reached to SciKit Learn

回答1:

The output of TfidfVectorizer is a scipy sparse matrix, not a simple numpy array.

So either use a different function like:

write_spmatrix_to_sparse_tensor

"""Writes a scipy sparse matrix to a sparse tensor"""

See this issue for more details.

OR first convert the output of TfidfVectorizer to a dense numpy array and then use your above code

xtrain_tfidf =  tfidf_vect.transform(X_train).toarray()   
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
...
...