We are at the beginning of an f# project involving real-time and historical analysis of streaming data. The data is contained in a c# object (see below) and is sent as part of a standard .net event. In real-time, the number of events we typically receive can vary greatly from less than 1/sec to upwards of around 800 events per second per instrument and thus can be very bursty. A typical day might accumulate 5 million rows/elements per insturment
A generic version of the C# event's data structure looks like this:
public enum MyType { type0 = 0, type1 = 1}
public class dataObj
{
public int myInt= 0;
public double myDouble;
public string myString;
public DateTime myDataTime;
public MyType type;
public object myObj = null;
}
We plan to use this data structure in f# in two ways:
- Historical analysis using supervised && unsupervised machine learning (CRFs, clustering models, etc)
- Real-time classification of data streams using the above models
The data structure needs to be able to grow as we add more events. This rules out array<t>
because it does not allow for resizing, though it could be used for the historical analysis. The data structure also needs to be able to quickly access recent data and ideally needs to be able to jump to data x points back. This rules out Lists<T>
because of the linear lookup time and because there is no random access to elements, just "forward-only" traversal.
According to this post, Set<T>
may be a good choice...
EDIT: Yin Zhu response gave me some additional clarity into exactly what I was asking. I have edited the remainder of the post to reflect this. Also, the previous version of this question was muddied by the introduction of requirements for historical analysis. I have omitted them.
Here is a breakdown of the steps of the real-time process:
- A realtime event is received
- This event is placed in a data structure. This is the data structure that we are trying to determine. Should it be a
Set<T>
, or some other structure? - A subset of the elements are either extracted or somehow iterated over for the purpose of feature generation. This would either be the last n rows/elements of the data structure (ie. last 1000 events or 10,000 events) or all the elements in the last x secs/mins (i.e all the events in the last 10 min). Ideally, we want a structure that allows us to do this efficiently. In particular, a data structure that allows for random access of the nth element without iteration through all the others elements is of value.
- Features for the model are generated and sent to a model for evaluation.
- We may prune the data structure of older data to improve performance.
So the question is what is the best data structure to use for storing the real-time streaming events that we will use to generated features.
Suppose your
dataObj
contains a unique ID field, then any set data structure would be fine for your job. The immutable data structures are primarily used for functional style code or persistency. If you don't need these two, you can useHashSet<T>
orSortedSet<T>
in the .Net collection library.Some stream specific optimization may be useful, e.g., keeping a fixed-size
Queue<T>
for the most recent data objects in the stream and store older objects in the more heavy weight set. I would suggest a benchmarking before switching to such hybrid data structure solutions.Edit:
After reading your requirements more carefully, I found that what you want is a queue with user-accessible indexing or backward enumerator. Under this data structure, your feature extraction operations (e.g. average/sum, etc) cost O(n). If you want to do some of the operations in O(log n), you can use more advanced data structures, e.g. interval trees or skip lists. However, you will have to implement these data structures yourself as you need to store meta information in the tree nodes which are behind collection API.
Difficult to say without more information.
If your data are coming in with timestamps in ascending order (i.e. they are never out of order) then you can just use some kind of queue or extensible array.
If your data can come in out of order and you need them reordered then you want a priority queue or indexed collection instead.
Those are extremely tame performance requirements for insertion rate.
If you only ever want elements near the beginning why do you want random access? Do you really want random access by index or do you actually want random access by some other key like time?
From what you've said I would suggest using an ordinary F#
Map
keyed on index maintained by aMailboxProcessor
that can append a new event and retrieve an object that allows all events to be indexed, i.e. wrap theMap
in an object that provides its ownItem
property and implementation ofIEnumerable<_>
. On my machine that simple solution takes 50 lines of code and can handle around 500,000 events per second.In his answer, Jack Fox suggests using either the FSharpx.Collections
Vector<'T>
or the SolidVector<'t>
by Greg Rosenbaum (https://github.com/GregRos/Solid). I thought I might give back a bit to the community by providing instructions on how to get up and running with each of them.Using the FSharpx.Collections.Vector<'T>
The process is pretty straight forward:
#r "FSharpx.Core.dll"
. You may need to use a full path.Usage:
Using the Solid.Vector<'T>
Getting setup to use the Solid
Vector<'t>
is a bit more involved. But the Solid version has a lot more handy functionality and as Jack pointed out, has a number of performance benefits. It also has a lot of useful documentation.Solid.dll
and theSolid.FSharp.dll
which is found in\Solid\SolidFS\obj\Debug\
folder. You will only need one open statement ->open Solid
Here is some code showing usage in a F# script file:
You should consider FSharpx.Collections.Vector. Vector<T> will give you Array-like features, including indexed O(log32(n)) look-up and update, which is within spitting distance of O(1), as well as adding new elements to the end of your sequence. There is another implementation of Vector which can be used from F# at Solid Vector. Very well documented and some functions perform up to 4X faster at large scale (element count > 10K). Both implementations perform very well up to and possibly beyond 1M elements.