MATLAB matfile increases in size when overwriting

2019-01-20 04:01发布

问题:

Due to large data size and frequent automatic saves I decided to change the method of saving from the standard save() function to partial saving using a matfile object:

https://www.mathworks.com/help/matlab/ref/matfile.html

I made this change because using save() will overwrite everything even if a minor change was made to the structure, greatly slowing the program. However I noticed that the time to save with matfile increased linearly every time it was called, and after some debugging I noticed that it was due to the file size increasing every time, even if data is being overwritten with the same data. Here is an example:

% Save MAT file with string variable and cell variable
  stringvar = 'hello'
  cellvar = {'world'}
  save('test.mat', 'stringvar', 'cellvar', '-v7.3')
  m = matfile('test.mat', 'Writable', true);
% Get number of bytes of MAT file
  f = dir('test.mat'); f.bytes
% Output: 3928 - inital size
% Overwrite stringvar with same data.
  m.stringvar = 'hello';
  f = dir('test.mat'); f.bytes
% Output: 3928 - same as before
% Overwrite cellvar with same data.
  m.cellvar = {'world'};
  f = dir('test.mat'); f.bytes
% Output: 4544 - size increased

I don't understand why the amount of bytes increases when the data is the same. It adds a very noticeable time delay that grows each save and so it defeats the purpose of partial saving. Any idea what's going on here? Help on this would be greatly appreciated!

回答1:

This is due to the way that cell arrays and more complex datatypes are stored (and updated) within the 7.3 (HDF5) mat files. Since a cell array contains mixed data-types, MATLAB stores the cell array variable in the root (/) HDF5 group as a series of references which point to the /#refs# group which contains datasets which each contain the data for one cell.

Whenever you attempt to overwrite the cell array value, the /#refs# HDF5 group gets appended to with new datasets which represent the cell array element data and the refrences in the / group are updated to point to this new data. The old (and now un-used) datasets in /#refs# are not removed. This is the designed behavior of HDF5 files since removing data from a file would require shifting all file contents after the deleted region to "close the gap" and this would incur a (potentially huge) performance penalty**.

We can use h5disp to look at the contents of the file that MATLAB is creating to illustrate this. Below I'll use an abbreviated output of h5disp so it's more legible:

stringvar = 'hello';
cellvar = {'world'};
save('test.mat', 'stringvar', 'cellvar', '-v7.3')

h5disp('test.mat')
% HDF5 test.mat
%    Group '/'
%        Dataset 'cellvar'                  <--- YOUR CELL ARRAY
%            Size:  1x1                     <--- HERE IS ITS SIZE
%            Datatype:   H5T_REFERENCE      <--- THE ACTUAL DATA LIVES IN /#REFS#
%            Attributes:
%                'MATLAB_class':  'cell'
%        Dataset 'stringvar'                <--- YOUR STRING
%            Size:  1x5                     <--- HAS 5 CHARACTERS
%            Datatype:   H5T_STD_U16LE (uint16)
%            Attributes:
%                'MATLAB_class':  'char'
%                'MATLAB_int_decode':  2
%        Group '/#refs#'                    <--- WHERE THE DATA FOR THE CELL ARRAY LIVES
%            Attributes:
%                'H5PATH':  '/#refs#'
%            Dataset 'a'
%                Size:  2
%                Datatype:   H5T_STD_U64LE (uint64)
%                Attributes:
%                    'MATLAB_empty':  1
%                    'MATLAB_class':  'canonical empty'
%            Dataset 'b'                    <--- THE CELL ARRAY DATA
%                Size:  1x5                 <--- CONTAINS A 5-CHAR STRING
%                Datatype:   H5T_STD_U16LE (uint16)
%                Attributes:
%                    'MATLAB_class':  'char'
%                    'MATLAB_int_decode':  2
%                    'H5PATH':  '/#refs#/b'

%% Now we want to replace the string with a 6-character string
m.stringvar = 'hellos';
h5disp('test.mat')
% HDF5 test.mat
%    Group '/'
%        Dataset 'cellvar'                      <--- THIS REMAINS UNCHANGED
%            Size:  1x1
%            Datatype:   H5T_REFERENCE
%            Attributes:
%                'MATLAB_class':  'cell'
%        Dataset 'stringvar'
%            Size:  1x6                         <--- JUST INCREASED THE LENGTH OF THIS TO 6
%            Datatype:   H5T_STD_U16LE (uint16)
%            Attributes:
%                'MATLAB_class':  'char'
%                'MATLAB_int_decode':  2
%        Group '/#refs#'
%            Attributes:
%                'H5PATH':  '/#refs#'
%            Dataset 'a'                        <--- NONE OF THIS HAS CHANGED
%                Size:  2
%                Datatype:   H5T_STD_U64LE (uint64)
%                Attributes:
%                    'MATLAB_empty':  1
%                    'MATLAB_class':  'canonical empty'
%            Dataset 'b'
%                Size:  1x5
%                Datatype:   H5T_STD_U16LE (uint16)
%                Attributes:
%                    'MATLAB_class':  'char'
%                    'MATLAB_int_decode':  2
%                    'H5PATH':  '/#refs#/b'

%% Now change the cell (and replace with a 6-character string)
m.cellvar = {'worlds'};
%    HDF5 test.mat
%    Group '/'
%        Dataset 'cellvar'                  <--- HERE IS YOUR CELL ARRAY AGAIN
%            Size:  1x1
%            Datatype:   H5T_REFERENCE      <--- STILL A REFERENCE
%            Attributes:
%                'MATLAB_class':  'cell'
%        Dataset 'stringvar'                <--- STRING VARIABLE UNCHANGED
%            Size:  1x6
%            Datatype:   H5T_STD_U16LE (uint16)
%            Attributes:
%                'MATLAB_class':  'char'
%                'MATLAB_int_decode':  2
%        Group '/#refs#'
%            Attributes:
%                'H5PATH':  '/#refs#'
%            Dataset 'a'                            <--- THE OLD DATA IS STILL HERE
%                Size:  2
%                Datatype:   H5T_STD_U64LE (uint64)
%                Attributes:
%                    'MATLAB_empty':  1
%                    'MATLAB_class':  'canonical empty'
%            Dataset 'b'                            <--- THE OLD DATA IS STILL HERE
%                Size:  1x5
%                Datatype:   H5T_STD_U16LE (uint16)
%                Attributes:
%                    'MATLAB_class':  'char'
%                    'MATLAB_int_decode':  2
%                    'H5PATH':  '/#refs#/b'
%            Dataset 'c'                            <--- THE NEW DATA IS ALSO HERE
%                Size:  2
%                Datatype:   H5T_STD_U64LE (uint64)
%                Attributes:
%                    'MATLAB_empty':  1
%                    'MATLAB_class':  'canonical empty'
%            Dataset 'd'                            <--- THE NEW DATA IS ALSO HERE
%                Size:  1x6                         <--- NOW WITH 6 CHARACTERS
%                Datatype:   H5T_STD_U16LE (uint16)
%                Attributes:
%                    'MATLAB_class':  'char'
%                    'MATLAB_int_decode':  2
%                    'H5PATH':  '/#refs#/d'

It is this increasing size of the #refs# group that is resulting in your file size increase. Since #refs# contains the actual data, all data within cell array elements that you are replacing will be duplicated each time that you save the file.

As for why the Mathworks opted to use HDF5 for 7.3 mat files despite this seemingly big limitation, it seems that the motivation for 7.3 files was to aid in the access of data within the files and not in the interest of optimizing file size.

One possible workaround is to use the 7.0 format which is a non-HDF5 format and the file size does not grow when modifying cell array variables. The only real downside of 7.0 vs 7.3 is that you can't modify just part of a variable in the 7.0 files. An added benefit is that for complex data, the 7.0 .mat files are typically faster to read and write compared to 7.3 HDF5 files.

% Helper function to tell us the size
printsize = @(filename)disp(getfield(dir(filename), 'bytes'));

stringvar = 'hello'
cellvar = {'world'}

% Save as 7.0 version
save('test.mat', 'stringvar', 'cellvar', '-v7')
printsize('test.mat')
%   256

m = matfile('test.mat', 'Writable', true);

m.stringvar = 'hello';
printsize('test.mat')
%   256

m.cellvar = {'world'};
printsize('test.mat')
%   256

If you still want to use 7.3 files, it may be worth saving the cell array to a temporary variable, modify that within your functions and only very rarely write that back to the file to prevent unnecessary writes.

tmp = m.cellvar;

% Make many modifications
tmp{1} = 'hello';
tmp{2} = 'world';
tmp{1} = 'Just kidding!';

% Write once after all changes have been made
m.cellvar = tmp;

** Normally you could use h5repack to reclaim the unused space in the file; however, MATLAB doesn't actually delete the data within /#refs# so h5repack has no effect. From what I gather, you'd have to delete the data yourself and then use h5repack to free up the unused space.

fid = H5F.open('test2.mat', 'H5F_ACC_RDWR', 'H5P_DEFAULT');

% I've hard-coded these names just as an example
H5L.delete(fid, '/#refs#/a', 'H5P_DEFAULT')
H5L.delete(fid, '/#refs#/b', 'H5P_DEFAULT')
H5F.close(fid);

system('h5repack test.mat test.repacked.mat');