Parallelize this nested for loop in python

I'm struggling again to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code.

I was first working with maps as explained in this question, but then I tried a more simple approach thinking that I could find a better solution. However I couldn't come up with anything yet, so since it's a different problem I decided to post it as a new question.

I am working on a Windows platform, using Python 3.4.

Here's the code:

similarity_matrix = [[0 for x in range(word_count)] for x in range(word_count)]
for i in range(0, word_count):
    for j in range(0, word_count):
        if i > j:
            similarity = calculate_similarity(t_matrix[i], t_matrix[j])
            similarity_matrix[i][j] = similarity
            similarity_matrix[j][i] = similarity

This is the calculate_similarity function:

def calculate_similarity(array_word1, array_word2):
      denominator = sum([array_word1[i] + array_word2[i] for i in range(word_count)])
      if denominator == 0:
          return 0
      numerator = sum([2 * min(array_word1[i], array_word2[i]) for i in range(word_count)])
      return numerator / denominator

And the explanation for the code:

word_count is the total number of unique words stored in a list
t_matrix is a matrix containing a value for each pair of words
the output should be similarity_matrix whose dimension is word_count x word_count also containing a similarity value for each pair of words
it's ok to keep both matrices in memory
after these computations I can easily find the most similar word for each words (or the top three similar words, as the task may require)
calculate_similarity takes two float lists, each for a separate word (each is a row in the t_matrix)

I work with a list of 13k words, and if I calculated correctly the execution time on my system would be a few days. So, anything that will do the job in one day would be wonderful!

Maybe only parellelizing the calculation of numerator and denominator in calculate_similarity would make a significant improvement.

标签： python parallel-processing multiprocessing python-multithreading python-multiprocessing

3条回答

爷的心禁止访问

2楼-- · 2020-07-16 09:21

You are using to many list comprehensions for such an amount of data. I would strongly recommend the numpy module. If that is an option you can do:

import numpy as np
import itertools

t = np.array(t_matrix)

s = np.sum(t,axis=1)

denom = s[:,None] + s[None,:]
num = np.zeros((word_count,word_count))

for i,j in itertools.product(range(word_count),repeat=2):
    num[i,j] = np.where(t[i] <= t[j], t[i], t[j]).sum()

similarity_matrix = np.where(denom != 0.0, 2.*num/denom, 0 )

0人赞添加讨论(0) 举报

时光不老，我们不散

3楼-- · 2020-07-16 09:27

Here's an alternative implementation of the same general algorithm as in Matt's answer, just using multiprocessing.Pool instead of concurrent.futures.ProcessPoolExecutor. It may be more efficient than his code, since the values of the input (t_matrix) are only serialized once and passed to the initializer function in each worker process.

import multiprocessing
import itertools

def worker_init(matrix):
    global worker_matrix
    worker_matrix = matrix

def worker(i, j):
    similarity = calculate_similarity(worker_matrix[i], worker_matrix[j])
    return i, j, similarity

def main(matrix):
    size = len(matrix)
    result = [[0]*size for _ in range(size)]
    with multiprocessing.Pool(initializer=worker_init, initargs=(matrix,)) as pool:
        for i, j, val in pool.starmap(worker, itertools.combinations(range(size), 2)):
            result[i][j] = result[j][i] = val
    return result

if __name__ == "__main__":
    # get t_matrix from somewhere
    main(t_matrix)

0人赞添加讨论(0) 举报

男人必须洒脱

4楼-- · 2020-07-16 09:43

from concurrent.futures import ProcessPoolExecutor, Future, wait
from itertools import combinations
from functools import partial

similarity_matrix = [[0]*word_count for _ in range(word_count)]

def callback(i, j, future):
    similarity_matrix[i][j] = future.result()
    similarity_matrix[j][i] = future.result()

with ProcessPoolExecutor(max_workers=4) as executer:
    fs = []
    for i, j in combinations(range(wordcount), 2):
        future = excuter.submit(
                    calculate_similarity, 
                    t_matrix[i], 
                    t_matrix[j])

        future.add_done_callback(partial(callback, i, j))
        fs.append(future)

    wait(fs)

0人赞添加讨论(0) 举报

Parallelize this nested for loop in python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间