I have a large set (>2000) of time series data that I'd like to display using d3 in the browser. D3 is working great for displaying a subset of the data (~100 points) to the user, but I also want a "context" view (like this) to show the entire data set and allow users to select as subregion to view in detail.
However, performance is abysmal when trying to display that many points in d3. I feel like a good solution would be to select a sample of the data and then use some kind of interpolation (spline, polynomial, etc., this is the part I know how to do) to draw a curve that is reasonably similar to the actual data.
However, it's not clear to me how I ought to go about selecting the subset. The data (shown below) has rather flat regions where fewer samples would be needed for a decent interpolation, and other regions where the absolute derivative is quite high, where more frequent sampling is needed.
To further complicate matters, the data has gaps (where the sensor generating it was failing or out of range), and I'd like to keep these gaps in the chart rather than interpolating through them. Detection of the gaps is fairly simple though, and simply clipping them out after drawing the entire data set with the interpolation seems like a reasonable solution.
I'm doing this in JavaScript, but a solution in any language or a mathematical answer to the problem would do.
You could use the d3fc-sample module, which provides a number of different algorithms for sampling data. Here's what the API looks like:
// Create the sampler
var sampler = fc_sample.largestTriangleThreeBucket();
// Configure the x / y value accessors
sampler.x(function (d) { return d.x; })
.y(function (d) { return d.y; });
// Configure the size of the buckets used to downsample the data.
sampler.bucketSize(10);
// Run the sampler
var sampledData = sampler(data);
You can see an example of it running on the website:
http://d3fc.github.io/d3fc-sample/
The largest-triangle three-buckets algorithm works quite well on data that is 'patchy'. It doesn't vary the bucket size, but does ensure that peaks / troughs are included, which results in a good representation of the sampled data.
I know this doesn't answer your question entirely, but this library might help you to simplify your line during rendering. Not sure if they handle data gaps though.
http://mourner.github.io/simplify-js/
My advice is to average (not subsample) over longer or shorter time intervals and plot those average values as horizontal bars. I think that's very comprehensible to the user -- if you try something fancier, you might give up the ability to explain exactly what's going on. I'm assuming you can let the user choose to zoom in or out so as to show more or less detail.
You might be able to get the database engine to compute averages over intervals for you, so that's a potential speed-up too.
As to the time intervals to pick, you could try either (1) fixed intervals such as 1 second, 15 seconds, 1 minute, 15 minutes, hours, days, or whatever; that might be easier for the user to understand, or (2) choose the interval to make a fixed number of units across the whole time range, e.g. if you decide to display 7 hours of data in 100 units, then each unit = 252 seconds.