I know matlab has a built in pdist function that will calculate pairwise distances. However, my matrix is so large that its 60000 by 300 and matlab runs out of memory.
This question is a follow up on Matlab euclidean pairwise square distance function.
Is there any workaround for this computational inefficiency. I tried manually coding the pairwise distance calculations and it usually takes a full day to run (sometimes 6 to 7 hours).
Any help is greatly appreciated!
On my system the following is the fastest (Even faster than the C code
pdistc
by @horchler):You'll need a very well tuned C code to beat this, I think.
Update
Since MATLAB R2016b MATLAB supports implicit broadcasting without the use of
bsxfun()
.Hence the code can be written:
A generalization is given in my Calculate Distance Matrix project.
P. S.
Using MATLAB's
pdist
for comparison:squareform(pdist(mX.'))
is equivalent toCalcDistMtx(mX)
.Namely the input should be transposed.
Computers are not infinitely large, or infinitely fast. People think that they have a lot of memory, a fast CPU, so they just create larger and larger problems, and then eventually wonder why their problem runs slowly. The fact is, this is NOT computational inefficiency. It is JUST an overloaded CPU.
As Oli points out in a comment, there are something like 2e9 values to compute, even assuming you only compute the upper or lower half of the distance matrix. (6e4^2/2 is approximately 2e9.) This will require roughly 16 gigabytes of RAM to store, assuming that only ONE copy of the array is created in memory. If your code is sloppy, you might easily double or triple that. As soon as you go into virtual memory, things get much slower.
Wanting a big problem to run fast is not enough. To really help you, we need to know how much RAM is available. Is this a virtual memory issue? Are you using 64 bit MATLAB, on a CPU that can handle all the needed RAM?
Well, I couldn't resist playing around. I created a Matlab mex C file called
pdistc
that implements pairwise Euclidean distance for single and double precision. On my machine using Matlab R2012b and R2015a it's 20–25% faster thanpdist
(and the underlyingpdistmex
helper function) for large inputs (e.g., 60,000-by-300).As has been pointed out, this problem is fundamentally bounded by memory and you're asking for a lot of it. My mex C code uses minimal memory beyond that needed for the output. In comparing its memory usage to that of
pdist
, it looks like the two are virtually the same. In other words,pdist
is not using lots of extra memory. Your memory problem is likely in the memory used up before callingpdist
(can you useclear
to remove any large arrays?) or simply because you're trying to solve a big computational problem on tiny hardware.So, my
pdistc
function likely won't be able to save you memory overall, but you may be able to use another feature I built in. You can calculate chunks of your overall pairwise distance vector. Something like this:This is considerably slower (10 times or so) and this part of my C code is not optimized well, but it will allow much less memory use – assuming that you don't need the entire array at one time. Note that you could do the same thing much more efficiently with
pdist
(orpdistc
) by creating a loop where you passed in subsets ofX
directly, rather than all of it.If you have a 64-bit Intel Mac, you won't need to compile as I've included the
.mexmaci64
binary, but otherwise you'll need to figure out how to compile the code for your machine. I can't help you with that. It's possible that you may not be able to get it to compile or that there will be compatibility issues that you'll need to solve by editing the code yourself. It's also possible that there are bugs and the code will crash Matlab. Also, note that you may get slightly different outputs relative topdist
with differences between the two in the range of machine epsilon (eps
).pdist
may or may not do fancy things to avoid overflows for large inputs and other numeric issues, but be aware that my code does not.Additionally, I created a simple pure Matlab implementation. It is massively slower than the mex code, but still faster than a naïve implementation or the code found in
pdist
.All of the files can be found here. The ZIP archive includes all of the files. It's BSD licensed. Feel free to optimize (I tried BLAS calls and OpenMP in the C code to no avail – maybe some pointer magic or GPU/OpenCL could further speed it up). I hope that it can be helpful to you or someone else.