Fast Algorithms for Finding Pairwise Euclidean Dis

2019-01-08 00:17发布

I know matlab has a built in pdist function that will calculate pairwise distances. However, my matrix is so large that its 60000 by 300 and matlab runs out of memory.

This question is a follow up on Matlab euclidean pairwise square distance function.

Is there any workaround for this computational inefficiency. I tried manually coding the pairwise distance calculations and it usually takes a full day to run (sometimes 6 to 7 hours).

Any help is greatly appreciated!

3条回答
Ridiculous、
2楼-- · 2019-01-08 00:32

On my system the following is the fastest (Even faster than the C code pdistc by @horchler):

function [ mD ] = CalcDistMtx ( mX )    
  vSsqX = sum(mX .^ 2);
  mD = sqrt(bsxfun(@plus, vSsqX.', vSsqX) - (2 * (mX.' * mX)));       
end

You'll need a very well tuned C code to beat this, I think.

Update
Since MATLAB R2016b MATLAB supports implicit broadcasting without the use of bsxfun().

Hence the code can be written:

function [ mD ] = CalcDistMtx ( mX )    
  vSsqX = sum(mX .^ 2, 1);
  mD = sqrt(vSsqX.'+ vSsqX - (2 * (mX.' * mX)));       
end

A generalization is given in my Calculate Distance Matrix project.

P. S.
Using MATLAB's pdist for comparison: squareform(pdist(mX.')) is equivalent to CalcDistMtx(mX).
Namely the input should be transposed.

查看更多
贪生不怕死
3楼-- · 2019-01-08 00:46

Computers are not infinitely large, or infinitely fast. People think that they have a lot of memory, a fast CPU, so they just create larger and larger problems, and then eventually wonder why their problem runs slowly. The fact is, this is NOT computational inefficiency. It is JUST an overloaded CPU.

As Oli points out in a comment, there are something like 2e9 values to compute, even assuming you only compute the upper or lower half of the distance matrix. (6e4^2/2 is approximately 2e9.) This will require roughly 16 gigabytes of RAM to store, assuming that only ONE copy of the array is created in memory. If your code is sloppy, you might easily double or triple that. As soon as you go into virtual memory, things get much slower.

Wanting a big problem to run fast is not enough. To really help you, we need to know how much RAM is available. Is this a virtual memory issue? Are you using 64 bit MATLAB, on a CPU that can handle all the needed RAM?

查看更多
The star\"
4楼-- · 2019-01-08 00:50

Well, I couldn't resist playing around. I created a Matlab mex C file called pdistc that implements pairwise Euclidean distance for single and double precision. On my machine using Matlab R2012b and R2015a it's 20–25% faster than pdist(and the underlying pdistmex helper function) for large inputs (e.g., 60,000-by-300).

As has been pointed out, this problem is fundamentally bounded by memory and you're asking for a lot of it. My mex C code uses minimal memory beyond that needed for the output. In comparing its memory usage to that of pdist, it looks like the two are virtually the same. In other words, pdist is not using lots of extra memory. Your memory problem is likely in the memory used up before calling pdist (can you use clear to remove any large arrays?) or simply because you're trying to solve a big computational problem on tiny hardware.

So, my pdistc function likely won't be able to save you memory overall, but you may be able to use another feature I built in. You can calculate chunks of your overall pairwise distance vector. Something like this:

m = 6e3;
n = 3e2;
X = rand(m,n);
sz = m*(m-1)/2;

for i = 1:m:sz-m
    D = pdistc(X', i, i+m); % mex C function, X is transposed relative to pdist
    ...                     % Process chunk of pairwise distances
end

This is considerably slower (10 times or so) and this part of my C code is not optimized well, but it will allow much less memory use – assuming that you don't need the entire array at one time. Note that you could do the same thing much more efficiently with pdist (or pdistc) by creating a loop where you passed in subsets of X directly, rather than all of it.

If you have a 64-bit Intel Mac, you won't need to compile as I've included the .mexmaci64 binary, but otherwise you'll need to figure out how to compile the code for your machine. I can't help you with that. It's possible that you may not be able to get it to compile or that there will be compatibility issues that you'll need to solve by editing the code yourself. It's also possible that there are bugs and the code will crash Matlab. Also, note that you may get slightly different outputs relative to pdist with differences between the two in the range of machine epsilon (eps). pdist may or may not do fancy things to avoid overflows for large inputs and other numeric issues, but be aware that my code does not.

Additionally, I created a simple pure Matlab implementation. It is massively slower than the mex code, but still faster than a naïve implementation or the code found in pdist.

All of the files can be found here. The ZIP archive includes all of the files. It's BSD licensed. Feel free to optimize (I tried BLAS calls and OpenMP in the C code to no avail – maybe some pointer magic or GPU/OpenCL could further speed it up). I hope that it can be helpful to you or someone else.

查看更多
登录 后发表回答