Let's assume I have the following 9 x 5 matrix:
myArray = [
54.7 8.1 81.7 55.0 22.5
29.6 92.9 79.4 62.2 17.0
74.4 77.5 64.4 58.7 22.7
18.8 48.6 37.8 20.7 43.5
68.6 43.5 81.1 30.1 31.1
18.3 44.6 53.2 47.0 92.3
36.8 30.6 35.0 23.0 43.0
62.5 50.8 93.9 84.4 18.4
78.0 51.0 87.5 19.4 90.4
];
I have 11 "subsets" of this matrix and I need to run a function (let's say max
) on each of these subsets. The subsets can be identified with the following matirx of logicals (identified column-wise, not row-wise):
myLogicals = logical([
0 1 0 1 1
1 1 0 1 1
1 1 0 0 0
0 1 0 1 1
1 0 1 1 1
1 1 1 1 0
0 1 1 0 1
1 1 0 0 1
1 1 0 0 1
]);
or via linear indexing:
starts = [2 5 8 10 15 23 28 31 37 40 43]; #%index start of each subset
ends = [3 6 9 13 18 25 29 33 38 41 45]; #%index end of each subset
such that the first subset is 2:3, the second is 5:6, and so on.
I can find the max
of each subset and store it in a vector as follows:
finalAnswers = NaN(11,1);
for n=1:length(starts) #%i.e. 1 through the number of subsets
finalAnswers(n) = max(myArray(starts(n):ends(n)));
end
After the loop runs, finalAnswers
contains the maximum value of each of the data subsets:
74.4 68.6 78.0 92.9 51.0 81.1 62.2 47.0 22.5 43.5 90.4
Is it possible to obtain the same result without the use of a for
loop? In other words, can this code be vectorized? Would such an approach be more efficient than the current one?
EDIT: I did some testing of the proposed solutions. The data I used was a 1,510 x 2,185 matrix with 10,103 subsets that varied in length from 2 to 916 with a standard deviation of subset length of 101.92.
I wrapped each solution in tic;for k=1:1000 [code here] end; toc;
and here are the results:
for
loop approach ---Elapsed time is 16.237400 seconds.
- Shai's approach ---
Elapsed time is 153.707076 seconds.
- Dan's approach ---
Elapsed time is 44.774121 seconds.
- Divakar's approach #2 ---
Elapsed time is 127.621515 seconds.
Notes:
- I also tried benchmarking Dan's approach by wrapping the
k=1:1000 for
loop around just theaccumarray
line (since the rest could be theoretically run just once). In this case the time was 28.29 seconds. - Benchmarking Shai's approach, while leaving the
lb = ...
line out of thek
loop, the time was 113.48 seconds. - When I ran Divakar's code, I got
Non-singleton dimensions of the two input arrays must match each other.
errors for thebsxfun
lines. I "fixed" this by using conjugate transposition (the apostrophe operator'
) ontrade_starts(1:starts_extent)
andintv(1:starts_extent)
in the lines of code callingbsxfun
. I'm not sure why this error was occuring...
I'm not sure if my benchmarking setup is correct, but it appears that the for
loop actually runs the fastest in this case.
One approach is to use
accumarray
. Unfortunately in order to do that we first need to "label" your logical matrix. Here is a convoluted way of doing that if you don't have the image processing toolbox:So that just does what Shai's
bwlabeln
implementation does (but this will be1
-by-numel(myLogicals)
in shape as opposed tosize(myLogicals)
in shape)Now you can use
accumarray
:or else it may be faster to try
This is fully vectorized, but is it worth it? You'll have to do speed tests against your loop solution to know.
Efficiency
Measure both the
vectorised
&for-loop
code samples on your respective platform ( be it a <localhost> or Cloud-based ) to see the difference:and
For a better interpretation of the test results, rather test on much larger sizes than just on a few tens of row/cols.
EDIT: An erroneous code removed, thanks Dan for the notice. Having taken more attention to emphasize the quantitative validation, that may prove the assumption that a vectorised code may, but need not in all circumstances, be faster is not an excuse for a faulty code, sure.
Output - quantitatively comparative data:
While recommended, there is not IMHO fair to assume, the memalloc and similar overheads to be excluded from the in-vivo testing. Test re-runs typically show VM-page hits improvements, other caching artifacts, while the raw 1st "virgin" run is what typically appears in the real code deployment ( excl. external iterators, for sure ). So consider the results with care and retest in your real environment ( sometimes being run as a Virtual Machine inside a bigger system -- that also makes VM-swap mechanics necessary to take into account once huge matrices start hurt on real-life memory-access patterns ).
On other Projects I am used to use
[usec]
granularity of the realtime test timing, but the more care is necessary to be taken into account about the test-execution conditions and O/S background.So nothing but testing gives relevant answers to your specific code/deployment situation, however be methodic to compare data comparable in principle.
Alarik's code:
Dan's code:
Use
bwlabeln
with a vertical connectivity:Now you have a label 1..11 for each region.
To get max value you can use
regionprops
You can use
regionprops
to get some other properties of each subset, but it is not too general.If you wish to apply a more general function to each region, e.g.,
median
, you can useaccumarray
:Ideas behind vectorization and optimization
One of the approaches that one can employ to vectorize this problem would be to convert the subsets into regular shaped blocks and then finding the max of the elements of the those blocks in one go. Now, converting to regular shaped blocks has one issue here and it is that the subsets are unequal in lengths. To avoid this issue, one can create a 2D matrix of indices starting from each of
starts
elements and extending until the maximum of the subset lengths. Good thing about this is, it allows vectorization, but at the cost of more memory requirements which would depend on the scattered-ness of the subsets lengths.Another issue with this vectorization technique would be that it could potentially lead to out-of-limits indices creations for final subsets. To avoid this, one can think of two possible ways -
Use a bigger input array by extending the input array such that maximum of the subset lengths plus the starts indices still lie within the confinements of the extended array.
Use the original input array for starts until we are within the limits of original input array and then for the rest of the subsets use the original loop code. We can call it the mixed programming just for the sake of having a short title. This would save us memory requirements on creating the extended array as discussed in the other approach earlier.
These two ways/approaches are listed next.
Approach #1: Vectorized technique
Approach #2: Mixed programming
Benchmarking
In this section we will compare the the two approaches and the original loop code against each other for performance. Let's setup codes before starting the actual benchmarking -
Now, the placebo codes that gets us the runtimes -
Benchmark Results
In your comments, you mentioned using
1510 x 2185 matrix
, so let's do two case runs with such size and subsets of size10000
and2000
.Case 1 [Input - 1510 x 2185 matrix, Subsets - 10000]
Case 2 [Input - 1510 x 2185 matrix, Subsets - 2000]
Case 3 [Bigger Input - 3000 x 3000 matrix, Subsets - 20000]
Note that the number of runs
num_runs
was varied to keep the runtime of the fastest approach close to1 sec
.Conclusions
So, I guess the mixed programming (approach #2) is the way to go! As future work, one can use
standard deviation
into the scattered-ness criteria if the performance suffers because of the scattered-ness and offload the work for most scattered subsets (in terms of their lengths) into the loop code.