Compare two different size matrices to make one la

2019-07-18 15:12发布

I have two matrices, that I need to use to create a larger matrix. Each matrix is simply a tab-delimited text file that is read. Each matrix has 48 cols with identical identifiers per matrix, with different numbers of rows. The first matrix is 108887x48, and the second is 55482x48. The entries at each position, for each matrix, can be a 0 or 1, so binary. The final output should have the first matrix row ids as the rows, and the second matrix row ids as the cols, so the final matrix is 55482x10887.

What needs to happen here is that for each pos in each row in the first matrix, for every row in the second matrix, if the pos (col) for each matrix is 1, then the final matrix count will go up 1. The highest value any pos in the final matrix can be is 48, and there is expected to be 0's left over.

Example:

mat1
     A B C D
1id1 0 1 0 1
1id2 1 1 0 0
1id3 1 1 1 1
1id4 0 0 1 0

mat2
     A B C D
2id1 1 1 0 0
2id2 0 1 1 0 
2id3 1 1 1 1 
2id4 1 0 1 0

final
     2id1 2id2 2id3 2id4
1id1   1    1    2    0
1id2   2    1    2    1
1id3   2    2    4    2
1id4   0    1    1    1

I have code to do this, however it is painfully slow, which is where I'm mostly asking for help. I've tried to speed up the algorithm as best as possible. It's been running for 24hrs, and is only about 25% of the way through. I have let it run through before and the final output file is 20GB. I'm not experienced with databases, and could implement it here, if osomeone could assist me on how to do so given a snippet of the code below.

#!/usr/bin/env python

import sys

mat1in = sys.argv[1]
mat2in = sys.argv[2]

print '\n######################################################################################'
print 'Generating matrix by counts from smaller matrices.'
print '########################################################################################\n'

with open(mat1in, 'r') as f:
        cols = [''] + next(f).strip().split('\t')               # First line of matrix is composed of 48 cols
        mat1 = [line.strip().split('\t') for line in f]         # Each line in matrix = 'ID': 0 or 1 per col id

with open(mat2in, 'r') as f:
        next(f)                                                 # Skip first row, col IDs are taken from mat1
        mat2 = [line.strip().split('\t') for line in f]         # Each line in matrix = 'ID': 0 or 1 per col id

out = open('final_matrix.txt', 'w')                             # Output file

#matrix = []
header = []                                                     # Final matrix header
header.append('')                                               # Add blank as first char in large matrix header
for i in mat2:
        header.append(i[0])                                     # Composed of all mat2 row ids
#matrix.append(header)

print >> out, '\t'.join(header)                                 # First print header to output file

print '\nTotal mat1 rows: ' + str(len(mat1))                    # Get total mat1 rows
print 'Total mat2 rows: ' + str(len(mat2)), '\n'                # Get total mat2 rows
print 'Progress: '                                              # Progress updated as each mat1 id is read

length = len(header)                                            # Length of header, i.e. total number of mat2 ids
totmat1 = len(mat1)                                             # Length of rows (-header), i.e. total number of mat1 ids

total = 0                                                       # Running total - for progress meter
for h in mat1:                                                  # Loop through all mat1 ids - each row in the HC matrix
        row = []                                                # Empty list for new row for large matrix
        row.append(h[0])                                        # Append mat1 id, as first item in each row
        for i in xrange(length-1):                              # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
                row.extend('0')
        for n in xrange(1,49):                                  # Loop through each col id
                for k in mat2:                                  # For every row in mat2
                        if int(h[n]) == 1 and int(k[n]) == 1:   # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
                                pos = header.index(k[0])        # Get the position of the mat2 id
                                row[pos] = str(int(row[pos]) + 1)       # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
        print >> out, '\t'.join(row)                            # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
        total += 1                                              # Update running total
        sys.stdout.write('\r\t' + str(total) + '/' + str(tvh))  # Print progress to screen
        sys.stdout.flush()

print '\n######################################################################################'
print 'Matrix complete.'
print '########################################################################################\n'

Here's what profiling the first 30 iterations for the ids in mat1:

######################################################################################
Generating matrix by counts from smaller matrices.
########################################################################################


Total mat1 rows: 108887
Total mat2 rows: 55482

Progress:
        30/108887^C         2140074 function calls in 101.234 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   70.176   70.176  101.234  101.234 build_matrix.py:3(<module>)
        4    0.000    0.000    0.000    0.000 {len}
    55514    0.006    0.000    0.006    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1719942    1.056    0.000    1.056    0.000 {method 'extend' of 'list' objects}
       30    0.000    0.000    0.000    0.000 {method 'flush' of 'file' objects}
    35776   29.332    0.001   29.332    0.001 {method 'index' of 'list' objects}
       31    0.037    0.001    0.037    0.001 {method 'join' of 'str' objects}
   164370    0.589    0.000    0.589    0.000 {method 'split' of 'str' objects}
   164370    0.033    0.000    0.033    0.000 {method 'strip' of 'str' objects}
       30    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {next}
        3    0.004    0.001    0.004    0.001 {open}

I've also timed each iteration, which takes about 2.5-3s per mat1 id, and if I'm correct would take about 90hrs to complete the whole thing. This is about what it took to run the entire script all the way through.

I've edited some of the main bits, to remove making the rows by append and xrange to making the list in one step by multipling '0' by the lengthof the headers. Also I made a dictionary of the mat2 ids with the index as values to, thinking dict lookup would be quicker than index..

headdict = {}
for k,v in enumerate(header):
        headdict[v] = k

total = 0                                                       # Running total - for progress meter
for h in mat1:                                                  # Loop through all mat1 ids - each row in the HC matrix
        timestart = time.clock()
        row = [h[0]] + ['0']*(length-1)                 # Empty list for new row for large matrix
        #row.append(h[0])                                       # Append mat1 id, as first item in each row
        #for i in xrange(length-1):                             # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
        #       row.append('0')
        for n in xrange(1,49):                                  # Loop through each col id
                for k in mat2:                                  # For every row in mat2
                        if int(h[n]) == 1 and int(k[n]) == 1:   # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
                                pos = headdict[k[0]] #header.index(k[0])        # Get the position of the mat2 id
                                row[pos] = str(int(row[pos]) + 1)       # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
        print >> out, '\t'.join(row)                            # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
        total += 1                                              # Update running total
        sys.stdout.write('\r\t' + str(total) + '/' + str(totmat1))  # Print progress to screen
        #sys.stdout.flush()
        timeend = time.clock()
        print timestart - timeend

2条回答
地球回转人心会变
2楼-- · 2019-07-18 15:49

This is just a matrix multiplication. You want to multiply the first matrix by the transpose of the second. For efficient matrix operations, get NumPy.

If you read your two input matrices into NumPy arrays of dtype numpy.int8, then the computation is simply

m1.dot(m2.T)

It'll take a couple minutes, tops.

查看更多
Luminary・发光体
3楼-- · 2019-07-18 15:57

I don't quite understand what this code does (the single letter variable names doesn't help).

My suggestion: Try to reduce how many operations you do in the innermost loops. For instance, do you need to recalculate pos in the inner level?

pos = header.index(k[0])

If it's possible to reorder the nested loops k, h and n, you might be able to cut down the costly list.index, which is a O(n) operation.

查看更多
登录 后发表回答