I have a vector and a cell array (with repeating strings) with the same size. The cell array defines the groups. I want to find min/max values in the vector for each group.
For example:
value = randperm(5) %# just an example, non-unique in general
value =
4 1 2 3 5
group = {'a','b','a','c','b'};
[grnum, grname] = grp2idx(group);
I use ACCUMARRAY function for this:
grvalue = accumarray(grnum,value,[],@max);
So I have new cell array with unique group name (grname
) and new vector (grvalue
).
grname =
'a'
'b'
'c'
grvalue =
4
5
3
But I also need to find location index of values from old vector that has been included into the new vector.
gridx = 1 5 4
Any ideas? It's not necessary to use accumarray but I'm looking for fast vectorized solution.
The best vectorized answer I can see is:
gridx = arrayfun(@(grix)find((grnum(:)==grix) & (value(:)==grvalue(grix)),1),unique(grnum));
but I cannot call this a "fast" vectorized solution. arrayfun
is really useful, but generally no faster than a loop.
However, the fastest answer is not always vectorized. If I re-implement the code as you wrote it, but with a larger data set:
nValues = 1000000;
value = floor(rand(nValues,1)*100000);
group = num2cell(char(floor(rand(nValues,1)*4)+'a'));
tic;
[grnum, grname] = grp2idx(group);
grvalue = accumarray(grnum,value,[],@max);
toc;
My computer gives me a tic/toc time of 0.886 seconds. (Note, all tic/tock times are from the second run of a function defined in a file, to avoid one-time pcode generation.)
Adding the "vectorized" (really arrayfun
) one line gridx computation leads to a tic/tock time of 0.975 seconds. Not bad, additional investigation shows that most of the time is being consumed in the grp2idx
call.
If we reimplement this as a non-vectorized, simple loop, including the gridx
computation, like this:
tic
[grnum, grname] = grp2idx(group);
grvalue = -inf*ones(size(grname));
gridx = zeros(size(grname));
for ixValue = 1:length(value)
tmpGrIdx = grnum(ixValue);
if value(ixValue) > grvalue(tmpGrIdx)
grvalue(tmpGrIdx) = value(ixValue);
gridx(tmpGrIdx) = ixValue;
end
end
toc
the tic/toc time is about 0.847 seconds, slightly faster than the original code.
Taking this a bit further, most of the time appears to be lost in the cell-array memory access. For example:
tic; groupValues = double(cell2mat(group')); toc %Requires 0.754 seconds
tic; dummy = (cell2mat(group')); toc %Requires 0.718 seconds
If you initially define your group names as a numeric array (for example, I'll use groupValues
as I defined them above), the the times decrease quite a bit, even using the same code:
groupValues = double(cell2mat(group')); %I'm assuming this is precomputed
tic
[grnum, grname] = grp2idx(groupValues);
grname = num2cell(char(str2double(grname))); %Recapturing your original names
grvalue = -inf*ones(size(grname));
gridx = zeros(size(grname));
for ixValue = 1:length(value)
tmpGrIdx = grnum(ixValue);
if value(ixValue) > grvalue(tmpGrIdx)
grvalue(tmpGrIdx) = value(ixValue);
gridx(tmpGrIdx) = ixValue;
end
end
toc
This produces a tic/tock time of 0.16 seconds.
When faced with a similar problem*, I came up with this solution:
define the following function (in a .m file)
function i=argmax(x)
[~,i]=max(x);
end
then you can find the max locations as
gridx = accumarray(grnum,grnum,[],@(i)i(argmax(value(i))) );
and the max values as
grvalue = value(gridx);
(*if I understand your problem correctly)