Referencing and assigning a subset of a matlab dataset appears to be extremely inefficient and possibly scales like rows^2
Example:
alldata is a large dataset of mixed data - say 150,000 rows by 25 columns (integer, boolean and string).
The format for the dataset is:
'format', '%s%u%u%u%u%u%s%s%s%s%s%s%s%u%u%u%u%s%u%s%s%u%s%s%s%s%u%s%u%s%s%s%u%s'
I then convert 2 type integer cols into type boolean
the following subset assignment:
somedata = alldata(1:m,:)
takes >7 sec for m = 10,000 and ridiculous amounts of time for larger values of m. Plotting time vs m shows a m^2 type relationship which is strange, given that copying alldata is nearly instantaneous, as is using functions like sortrows and find. In fact reading the original .csv data file in is faster than the above assignment for large values of m.
Using the profiler, it appears there is a function subref that includes a very slow line that checks for string comparisons to determine unique values within the dataset. Is this related to how the dataset type is stored (i.e. a reference table)? The dataset includes large number of unique string values.
Are their any solutions to extracting a subset of a dataset in matlab? Such as preallocation (how?), or copying the dataset and deleting rows rather than assigning an extract/subset.
I am using a dual core machine with 1.5Gb ram, but task manager reports less than 1Gb of ram in use.
I have previously worked with MATLAB's dataset array for large data, unfortunately its true that they do suffer from performance issues. One thing I found which helps speed things up, is to clear the observation names (ObsNames) property
Try the following fix:
%# I assume you have a 'dataset' object
ds = dataset(...);
%# clear the observation names property (It simply a label for each record)
ds.Properties.ObsNames = [];
Amro suggested clearing the observation names:
ds.Properties.ObsNames = [];
This results in a massive performance benefit as the subset assignment changes from quadratic (linear when plotted against rows^2) to linear (when plotted against rows) with rows at the minor cost of losing the ObsNames.
Copying a DataSet is near instantaneous, so when combined with clearing the unneeded rows also results in a massive performance improvement, though slightly a less optimal solution (but with no loss of ObsNames). Performance is about 2x slower compared to dropping ObsNames. This only improves by 2% when ObsNames are also dropped.
supporting data
I used a small script to assign a subset rows of a 150,000 x 25 mixed string/integer/boolean dataset generated the following time measurements (seconds).
The memory heap size made no significant difference in performance and was left at 128 MB.
Subref means standard function for subset assignment was used
Rows, subref, subref&ObsName=[], Delete, Delete&ObsName=[]
8000, 4.19, 2.06, 4.81, 4.72
32000, 57.61, 2.49, 5.26, 5.21
72000, 390.72, 3.21, 6.09, 6.03
128000, ?(*), 4.21, 7.25, 7.19
(*) I gave up on evaluating this value. Based on linear extrapolation against rows^2 I would guess 2000 sec, or half an hour.
Script
clear
load('data'); % load 'alldata' dataset
% alldata.Properties.ObsNames = []; % drop obsnames
tic;
x = ((1:4).^2.*8000);
for h = 1:length(x)
start = toc;
somedata = alldata(1:x(h),:);
% somedata = alldata;
% somedata(x(h):end,:) = []; % drop unrequired obs
t(h) = toc - start;
clear somedata
disp([x(h), t(h)]);
end