GridSearch for doc2vec model built using gensim

2019-07-31 04:29发布

I am trying to find best hyperparameters for my trained doc2vec gensim model which takes a document as an input and create its document embeddings. My train data consists of text documents but it doesn't have any labels. i.e. I just have 'X' but not 'y'.

I found some questions here related to what I am trying to do but all of the solutions are proposed for supervised models but none for unsupervised like mine.

Here is the code where I am training my doc2vec model:

def train_doc2vec(
    self,
    X: List[List[str]],
    epochs: int=10,
    learning_rate: float=0.0002) -> gensim.models.doc2vec:

    tagged_documents = list()

    for idx, w in enumerate(X):
        td = TaggedDocument(to_unicode(str.encode(' '.join(w))).split(), [str(idx)])
        tagged_documents.append(td)

    model = Doc2Vec(**self.params_doc2vec)
    model.build_vocab(tagged_documents)

    for epoch in range(epochs):
        model.train(tagged_documents,
                    total_examples=model.corpus_count,
                    epochs=model.epochs)
        # decrease the learning rate
        model.alpha -= learning_rate
        # fix the learning rate, no decay
        model.min_alpha = model.alpha

    return model

I need suggestions on how to proceed and find best hyperparameters for my trained model using GridSearch or any suggestions about some other technique. Help is much appreciated.

1条回答
Explosion°爆炸
2楼-- · 2019-07-31 05:21

Independently by the correctness of the code, I will try to answer to your question on how to perform a tuning of hyper-parameters. You have to start defining a set of hyper-parameters that will define your hyper-parameter grid search. For each set of hyper-parameters

Hset1=(par1Value1,par2Value1,...,par3Value1)

you train your model on the training set and you use an independent validation set to measure your accuracy (or whatever metrics you wish to use). You store this value (e.g. A_Hset1). When you do this for all the possible set of hyper-parameters you will have a set of measures

(A_Hset1,A_Hset2,A_Hset3...A_HsetK).

Each one of those measure tells you how good is your model for each set of hyper-parameters so your set of of optimal hyper-parameters

H_setOptimal= HsetX | A_setX=max(A_Hset1,A_Hset2,A_Hset3...A_HsetK)

In order to have a fair comparisons you should train the model always on the same data and use always the same validation set.

I'm not an advanced Python user so probably you can find better suggestions around, but what I would do is to create a list of dictionaries, where each dictionary contain a set of hyper-parameters that you want to test:

grid_search=[{"par1":"val1","par2":"val1","par3":"val1",..., "res"=""},
             {"par1":"val2","par2":"val1","par3":"val1",..., "res"=""},
             {"par1":"val3","par2":"val1","par3":"val1",..., "res"=""},
             ,...,
             {"par1":"valn","par2":"valn","par3":"valn",..., "res"=""}]

So that you can store your results in the "res" field of the corresponding dictionary and track the performances for each set of parameter.

for set in grid_search:
  #insert here your training and accuracy evaluation using the
  #parameters in set
  
  set["res"]= the_Accuracy_for_HyperPar_in_set

I hope it helps.

查看更多
登录 后发表回答