I have a bunch of times-series each described by two components, a timestamp vector (in seconds), and a vector of values measured. The time vector is non-uniform (i.e. sampled at non-regular intervals)
I am trying to compute the mean/SD of each 1-minutes interval of values (take X minute interval, compute its mean, take the next interval, ...).
My current implementation uses loops. This is a sample of what I have so far:
t = (100:999)' + rand(900,1); %' non-uniform time
x = 5*rand(900,1) + 10; % x(i) is the value at time t(i)
interval = 1; % 1-min interval
tt = ( floor(t(1)):interval*60:ceil(t(end)) )'; %' stopping points of each interval
N = length(tt)-1;
mu = zeros(N,1);
sd = zeros(N,1);
for i=1:N
indices = ( tt(i) <= t & t < tt(i+1) ); % find t between tt(i) and tt(i+1)
mu(i) = mean( x(indices) );
sd(i) = std( x(indices) );
end
I am wondering if there a faster vectorized solution. This is important because I have a large number of time-series to process each much longer than the sample shown above..
Any help is welcome.
Thank you all for the feedback.
I corrected the way t
is generated to be always monotonically increasing (sorted), this was not really an issue..
Also, I may not have stated this clearly but my intention was to have a solution for any interval length in minutes (1-min was just an example)
Here's a way that uses binary search. It is 6-10x faster for 9900 elements and about 64x times faster for 99900 elements. It was hard to get reliable times using only 900 elements so I'm not sure which is faster at that size. It uses almost no extra memory if you consider making tx directly from the generated data. Other than that it just has four extra float variables (prevind, first, mid, and last).
It uses all of the variables that you had originally. I hope that it suits your needs. It is faster because it takes O(log N) to find the indices with binary search, but O(N) to find them the way you were doing it.
You could try and create a cell array and apply mean and std via cellfun. It's ~10% slower than your solution for 900 entries, but ~10x faster for 90000 entries.
Note: my solution does not give the exact same results as yours, since you skip a few time values at the end (1:60:90 is [1,61]), and since the start of the interval is not exactly the same.
The same answer as above but with the parametric interval (
window_size
). Issue with the vector lengths solved as well.Disclaimer: I worked this out on paper, but haven't yet had the opportunity to check it "in silico"...
You may be able to avoid loops or using cell arrays by doing some tricky cumulative sums, indexing, and calculating the means and standard deviations yourself. Here's some code that I believe will work, although I am unsure how it stacks up speed-wise to the other solutions:
The above computes the standard deviation using the simplification of the formula found on this Wikipedia page.
You can compute
indices
all at once using bsxfun:This is faster than looping but requires storing them all at once (time vs space tradeoff)..
The only logical solution seems to be...
Ok. I find it funny that to me there is only one logical solution, but many others find other solutions. Regardless, the solution does seem simple. Given the vectors x and t, and a set of equally spaced break points tt,
(Note that I sorted t above.)
I would do this in three fully vectorized lines of code. First, if the breaks were arbitrary and potentially unequal in spacing, I would use histc to determine which intervals the data series falls in. Given they are uniform, just do this:
Again, if the elements of t were not known to be sorted, I would have used min(t) instead of t(1). Having done that, use accumarray to reduce the results into a mean and standard deviation.