RAKE with GENSIM

2020-04-21 08:13发布

问题:

I am trying to calculate similarity. First of all i used RAKE library to extract the keywords from the crawled jobs. Then I put the keywords of every jobs into separate array and then combined all those arrays into documentArray.

documentArray = ['Anger command,Assertiveness,Approachability,Adaptability,Authenticity,Aggressiveness,Analytical thinking,Molecular Biology,Molecular Biology,Molecular Biology,molecular biology,molecular biology,Master,English,Molecular Biology,,Islamabad,Islamabad District,Islamabad Capital Territory,Pakistan,,Rawalpindi,Rawalpindi,Punjab,Pakistan'"], ['competitive compensation,assay design,positive attitude,regular basis,motivate others,meetings related,improve state,travel on,phd degree,meeting abstracts,benefits package,daily basis,scientific papers,application notes']


queryStr = 'In Vitro,Biochemistry,PCR,Western Blotting,Neuroscience,Molecular Biology,Cell biology,Immunohistochemistry,Microscopy,Animal Models,Presentations,Immunoprecipitation,Cell biology,Master's Degree,Bachelor's Degree,,,,,'

Then I wrote the following GENSIM code,

class Gensim:

def __init__(self):
    print("Init")

def calculateGensimSimilarity(self, texts, query):
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
    lda = models.LdaModel(corpus, id2word=dictionary, num_topics=2)
    index_lsi = similarities.MatrixSimilarity(lsi[corpus])
    index_lda = similarities.MatrixSimilarity(lda[corpus])
    vec_bow = dictionary.doc2bow(query.lower().split())
    vec_lsi = lsi[vec_bow]
    vec_lda = lda[vec_bow]
    print("LSI Model")
    sims_lsi = index_lsi[vec_lsi]
    print("LDA Model")
    print(sims_lsi)
    sims_lda = index_lda[vec_lda]
    print(sims_lda)

It is printing LSA score 0 and LDA score 90%+ match. Kindly let me know where I am wrong and how can i modify to calculate the correct cosine similarity.

LSA Score[ 0. 0.] LDA Score[ 0.94234258 0.9477495 ]