Preallocation of cell array in matlab

2020-03-21 05:54发布

问题:

This is more a question to understand a behavior rather than a specific problem.

Mathworks states that numerical are stored continuous which makes preallocation important. This is not the case for cell arrays.

Are they something similar than vector or array of pointers in C++?

This would mean that prealocation is not so important since a pointer is half the size of a double (according to whos - but there surely is overhead somewhere to store the datatype of the mxArray).

Running this code:

clear all
n = 1e6;

tic
A = [];
for i=1:n
    A(end + 1) = 1;
end
fprintf('Numerical without preallocation %f s\n',toc)

clear A

tic
A = zeros(1,n);
for i=1:n
    A(i) = 1;
end
fprintf('Numerical with preallocation %f s\n',toc)

clear A
tic
A = cell(0);
for i=1:n
    A{end + 1} = 1;
end
fprintf('Cell without preallocation %f s\n',toc)

tic
A = cell(1,n);
for i=1:n
    A{i} = 1;
end
fprintf('Cell with preallocation %f s\n',toc)

returns: Numerical without preallocation 0.429240 s Numerical with preallocation 0.025236 s Cell without preallocation 4.960297 s Cell with preallocation 0.554257 s

There is no surprise for the numerical values. But the did surprise me since only the container of the pointers and not the data itself would need reallocation. Which should (since the pointer is smaller than a double) lead to difference of <.2s. Where does this overhead come from?

A related question would be, if I would like to make a data container for heterogeneous data in Matlab (preallocation is not possible since the final size is not known in the beginning). I think handle classes are not good since the also have huge overhead.

already looking forward to learn something

magu_

Edit: I tried out the linked list proposed by Eitan T but I think the overhead from matlab is still rather big. I tried something with an double array as data (rand(200000,1)).

I made a little plot to illustrate:

code for the graph: (I used the dlnode class from the matlab hompage as stated in the answering post)

D = rand(200000,1);

s = linspace(10,20000,50);
nC = zeros(50,1);
nL = zeros(50,1);

for i = 1:50
a = cell(0);

tic
for ii = 1:s(i)
    a{end + 1} = D;
end
nC(i) = toc;

a = list([]);

tic
for ii = 1:s(i)
    a.insertAfter(list(D));
end
nL(i) = toc;

end

figure
plot(s,nC,'r',s,nL,'g')
xlabel('#iter')
ylabel('time (s)')
legend({'cell' 'list'})

Don't get me wrong I love the idea of linked list, since there are rather flexible, but I think the overhead might be to big.

回答1:

Are cell arrays something similar to a vector or an array of pointers in C++?

Cell arrays allow storing data of different types and sizes indeed, but each cell also adds a constant overhead of 112 bytes (see this other answer of mine). This is far more than an 8-byte double, and this is non-negligible, especially when dealing with large cell arrays as in your example.

It is reasonable to assume that a cell array is implemented as a continuous array of pointers, each pointing to the actual content of the cell.

This means that you can modify the content of each cell individually without actually resizing the cell array container itself. However, this also means that adding new cells to the cell array requires dynamic storage allocation and this is why preallocating memory for a cell array improves performance.

A related question would be, if I would like to make a data container for heterogeneous data in Matlab (preallocation is not possible since the final size is not known in the beginning)

Not knowing the final size may indeed be a problem, but you could always preallocate a cell array with the maximum supported size necessary (if there is one), and remove the empty cells in the end. I also suggest that you look into implementing linked lists in MATLAB.