I'm struggling again to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code.
I was first working with maps as explained in this question, but then I tried a more simple approach thinking that I could find a better solution. However I couldn't come up with anything yet, so since it's a different problem I decided to post it as a new question.
I am working on a Windows platform, using Python 3.4.
Here's the code:
similarity_matrix = [[0 for x in range(word_count)] for x in range(word_count)]
for i in range(0, word_count):
for j in range(0, word_count):
if i > j:
similarity = calculate_similarity(t_matrix[i], t_matrix[j])
similarity_matrix[i][j] = similarity
similarity_matrix[j][i] = similarity
This is the calculate_similarity
function:
def calculate_similarity(array_word1, array_word2):
denominator = sum([array_word1[i] + array_word2[i] for i in range(word_count)])
if denominator == 0:
return 0
numerator = sum([2 * min(array_word1[i], array_word2[i]) for i in range(word_count)])
return numerator / denominator
And the explanation for the code:
word_count
is the total number of unique words stored in a listt_matrix
is a matrix containing a value for each pair of words- the output should be
similarity_matrix
whose dimension isword_count x word_count
also containing a similarity value for each pair of words - it's ok to keep both matrices in memory
- after these computations I can easily find the most similar word for each words (or the top three similar words, as the task may require)
calculate_similarity
takes two float lists, each for a separate word (each is a row in the t_matrix)
I work with a list of 13k words, and if I calculated correctly the execution time on my system would be a few days. So, anything that will do the job in one day would be wonderful!
Maybe only parellelizing the calculation of numerator
and denominator
in calculate_similarity
would make a significant improvement.
You are using to many list comprehensions for such an amount of data. I would strongly recommend the
numpy
module. If that is an option you can do:Here's an alternative implementation of the same general algorithm as in Matt's answer, just using
multiprocessing.Pool
instead ofconcurrent.futures.ProcessPoolExecutor
. It may be more efficient than his code, since the values of the input (t_matrix
) are only serialized once and passed to theinitializer
function in each worker process.