I want to process stock level-2 data in pandas. Suppose there are four kinds data in each row for simplicity:
- millis: timestamp, int64
- last_price: the last trade price, float64,
- ask_queue: the volume of ask side, a fixed size (200) array of int32
- bid_queue: the volume of bid side, a fixed size (200) array of int32
Which can be easily defined as a structured dtype in numpy:
dtype = np.dtype([
('millis', 'int64'),
('last_price', 'float64'),
('ask_queue', ('int32', 200)),
('bid_queue', ('int32', 200))
])
And in that way, I can access the ask_queue
and bid_queue
like:
In [17]: data = np.random.randint(0, 100, 1616 * 5).view(dtype)
% compute the average of ask_queue level 5 ~ 10
In [18]: data['ask_queue'][:, 5:10].mean(axis=1)
Out[18]:
array([33.2, 51. , 54.6, 53.4, 15. , 37.8, 29.6, 58.6, 32.2, 51.6, 34.4,
43.2, 58.4, 26.8, 54. , 59.4, 58.8, 38.8, 35.2, 71.2])
My question is how to define a DataFrame
include the data?
There are two solutions here:
A. set the ask_queue
and bid_queue
as two columns with array values as following:
In [5]: df = pd.DataFrame(data.tolist(), columns=data.dtype.names)
In [6]: df.dtypes
Out[6]:
millis int64
last_price float64
ask_queue object
bid_queue object
dtype: object
However, there at least two problems in this solution:
- The
ask_queue
andbid_queue
lost the dtype of 2D array and all the convenient methods; - Performance, since it become a array of objects rather than a 2D array.
B. flatten the ask_queue
and bid_quene
to 2 * 200
columns:
In [8]: ntype = np.dtype([('millis', 'int64'), ('last_price', 'float64')] +
...: [(f'{name}{i}', 'int32') for name in ['ask', 'bid'] for i in range(200)])
In [9]: df = pd.DataFrame.from_records(data.view(ntype))
In [10]: df.dtypes
Out[10]:
millis int64
last_price float64
ask0 int32
ask1 int32
ask2 int32
ask3 int32
ask4 int32
ask5 int32
...
It's better than solution A. But the 2 * 200 columns looks redundant.
Is there any solution can take the advantage as the structured dtype in numpy?
I wonder if the ExtensionArray
or `ExtensionDtype' can solve this.
Pandas has been designed to handle and process two-dimensional data (the kind you would put in a spreadsheet). Because "ask_queue" and "bid_queue" are not mono-dimensional series but two-dimensional arrays, you cannot (easily) push them into a Pandas dataframe.
In such cases, you have to use other libraries such as xarray: http://xarray.pydata.org/
Working with L2-DoM data has two-fold complications, compared to the just ToB ( Top-of-the-Book ) price-feed data. a) the native feed is fast ( very fast / FIX Protocol or other private data-feeds deliver records with hundreds, thousands ( more during fundamental events on majors ) L2-DoM changes per millisecond. Both processing and storage must be performance-oriented b) any kind of offline analyses has to successfully manipulate and efficiently process large data-sets, due to the nature of item a)
numpy
-alike syntax preferencesStorage preferences : SOLVED
Given
pandas.DataFrame
was set as the preferred storage type, let's respect that, even though syntax and performance preferences may take adverse impacts.Going other way is possible, yet may introduce unknown re-factoring / re-engineering costs, that the O/P's operational environment need not or already is not willing to bear.
Having said this,
pandas
feature limitations have to be put into the design considerations and all the other steps will have to live with it, unless this preference might get revised in some future time.numpy
-alike syntax : SOLVEDThis request is sound and clear, as
numpy
tools are fast and smart crafted for high-performance number-crunching. Given the set storage preference, we will implement a pair ofnumpy
-tricks so as to fit intopandas
2D-DataFrame
all at reasonable costs on both the.STORE
and.RETRIEVE
directions:Performance preferences : TESTED
Net add-on costs of the proposed solution for both
.STORE
and.RETRIEVE
directions were tested to take:A one-time cost on
.STORE
direction of no less than70 [us]
and no more than~ 160 [us]
per cell for given scales of L2_DoM arrays ( avg:78 [ms]
StDev:9-11 [ms]
):A repeating cost on
.RETRIEVE
direction of no less than46 [us]
and no more than~ 123 [us]
per cell for given scales of L2_DoM arrays ( avg:50 [us]
StDev:9.5 [us]
):Even higher performance is to be expected if using better architecture-aligned
int64
datatypes ( yes, at a cost of doubled costs of storage, yet the costs of computations will decide if this move has a performance edge ) and from a chance to usememoryview
-based manipulations, that can cut the throat down and shave the add-on latency to about22 [us]
.Test were run under py3.5.6, numpy v1.15.2, using:
Platform CPU, cache hierarchy and RAM details: