I have many points inside a square. I want to partition the square in many small rectangles and check how many points fall in each rectangle, i.e. I want to compute the joint probability distribution of the points. I am reporting a couple of common sense approaches, using loops and not very efficient:
% Data
N = 1e5; % number of points
xy = rand(N, 2); % coordinates of points
xy(randi(2*N, 100, 1)) = 0; % add some points on one side
xy(randi(2*N, 100, 1)) = 1; % add some points on the other side
xy(randi(N, 100, 1), :) = 0; % add some points on one corner
xy(randi(N, 100, 1), :) = 1; % add some points on one corner
inds= unique(randi(N, 100, 1)); xy(inds, :) = repmat([0 1], numel(inds), 1); % add some points on one corner
inds= unique(randi(N, 100, 1)); xy(inds, :) = repmat([1 0], numel(inds), 1); % add some points on one corner
% Intervals for rectangles
K1 = ceil(sqrt(N/5)); % number of intervals along x
K2 = K1; % number of intervals along y
int_x = [0:(1 / K1):1, 1+eps]; % intervals along x
int_y = [0:(1 / K2):1, 1+eps]; % intervals along y
% First approach
tic
count_cells = zeros(K1 + 1, K2 + 1);
for k1 = 1:K1+1
inds1 = (xy(:, 1) >= int_x(k1)) & (xy(:, 1) < int_x(k1 + 1));
for k2 = 1:K2+1
inds2 = (xy(:, 2) >= int_y(k2)) & (xy(:, 2) < int_y(k2 + 1));
count_cells(k1, k2) = sum(inds1 .* inds2);
end
end
toc
% Elapsed time is 46.090677 seconds.
% Second approach
tic
count_again = zeros(K1 + 2, K2 + 2);
for k1 = 1:K1+1
inds1 = (xy(:, 1) >= int_x(k1));
for k2 = 1:K2+1
inds2 = (xy(:, 2) >= int_y(k2));
count_again(k1, k2) = sum(inds1 .* inds2);
end
end
count_again_fix = diff(diff(count_again')');
toc
% Elapsed time is 22.903767 seconds.
% Check: the two solutions are equivalent
all(count_cells(:) == count_again_fix(:))
How can I do it more efficiently in terms of time, memory, and possibly avoiding loops?
EDIT --> I have just found this as well, it's the best solution found so far:
tic
count_cells_hist = hist3(xy, 'Edges', {int_x int_y});
count_cells_hist(end, :) = []; count_cells_hist(:, end) = [];
toc
all(count_cells(:) == count_cells_hist(:))
% Elapsed time is 0.245298 seconds.
but it requires the Statistics Toolbox.
EDIT --> Testing solution suggested by chappjc
tic
xcomps = single(bsxfun(@ge,xy(:,1),int_x));
ycomps = single(bsxfun(@ge,xy(:,2),int_y));
count_again = xcomps.' * ycomps; %' 143x143 = 143x1e5 * 1e5x143
count_again_fix = diff(diff(count_again')');
toc
% Elapsed time is 0.737546 seconds.
all(count_cells(:) == count_again_fix(:))
chappjc's answer and using
hist3
are all good, but since I happened to want to have something like this some time ago and for some reason didn't findhist3
I wrote it myself, and I thought I'd post it here as a bonus. It usessparse
to do the actual counting and returns the result as a sparse matrix, so it may be useful for dealing with a multimodal distribution where different modes are far apart – or for someone who doesn't have the Statistics Toolbox.Application to francesco's data:
Called with output parameters the function just returns the result, without it makes a color plot.
Here's the function:
function [H, xs, ys] = hist2d(x, y, n, ax)
Improving on code in question
Your loops (and the nested dot product) can be eliminated with
bsxfun
and matrix multiplication as follows:The multiplication step accomplishes the AND and summation done in
sum(inds1 .* inds2)
, but without looping over the density matrix. EDIT: If you usesingle
instead ofdouble
, execution time is nearly halved, but be sure to convert your answer todouble
or whatever is required for the rest of the code. On my computer this takes around 0.5 sec.Note: With
rot90(count_again/size(xy,1),2)
you have a CDF, and inrot90(count_again_fix/size(xy,1),2)
you have a PDF.Using accumarray
Another approach is to use
accumarray
to make the joint histogram after we bin the data.Starting with
int_x
,int_y
,K1
,xy
, etc.:On my computer, this takes around 0.01 sec.
The output is the same as A.Donda's converted from sparse to full (
full(H)
). Although, as he A.Donda pointed out, it is correct to have the dimensions beK1
xK1
, rather than the size ofcount_again_fix
in the OPs code that wasK1+1
xK1+1
.To get the CDF, I believe you can simply apply
cumsum
to each axis of the PDF.I have written a simple mex function which works very well when N is large. Of course it's cheating but still ...
The function is
It can be saved in a file "joint_dist_points_2D.c", then compiled:
And check it out: