My architecture:
- 1 EventHub with 8 Partitions & 2 TPUs
- 1 Streaming Analytics Job
- 6 Windows based on the same input (from 1mn to 6mn)
Sample Data:
{side: 'BUY', ticker: 'MSFT', qty: 1, price: 123, tradeTimestamp: 10000000000}
{side: 'SELL', ticker: 'MSFT', qty: 1, price: 124, tradeTimestamp:1000000000}
The EventHub PartitionKey
is ticker
I would like to emit every second, the following data:
(Total quantity bought / Total quantity sold) in the last minute, last 2mn, last 3mn and more
What I tried:
WITH TradesWindow AS (
SELECT
windowEnd = System.Timestamp,
ticker,
side,
totalQty = SUM(qty)
FROM [Trades-Stream] TIMESTAMP BY tradeTimestamp PARTITION BY PartitionId
GROUP BY ticker, side, PartitionId, HoppingWindow(second, 60, 1)
),
TradesRatio1MN AS (
SELECT
ticker = b.ticker,
buySellRatio = b.totalQty / s.totalQty
FROM TradesWindow b /* SHOULD I PARTITION HERE TOO ? */
JOIN TradesWindow s /* SHOULD I PARTITION HERE TOO ? */
ON s.ticker = b.ticker AND s.side = 'SELL'
AND DATEDIFF(second, b, s) BETWEEN 0 AND 1
WHERE b.side = 'BUY'
)
/* .... More windows.... */
/* FINAL OUTPUT: Joining all the windows */
SELECT
buySellRatio1MN = bs1.buySellRatio,
buySellRatio2MN = bs2.buySellRatio
/* more windows */
INTO [output]
FROM buySellRatio1MN bs1 /* SHOULD I PARTITION HERE TOO ? */
JOIN buySellRatio2MN bs2 /* SHOULD I PARTITION HERE TOO ? */
ON bs2.ticker = bs1.ticker
AND DATEDIFF(second, bs1, bs2) BETWEEN 0 AND 1
Issues:
- This requires 6 EventHub Consumer groups (each one can only have 5 readers), why ? I don't have 5x6 SELECT statements on the input, why then ?
- The output doesn't seem consistent (I don't know if my JOINs are correct).
- Sometimes the job doesn't output at all (maybe some partitioning problem ? see the comments in the code about partitioning)
Briefly, is there a better way to achieve this ? I couldn't find anything in the doc and examples about having multiple windows and joining them then joining the results of the previous joins from only 1 input.
For the first question, this depend of the internal implementation of the scale out logic. See details here.
For the output of the join, I don't see the whole query but if you join a query with a 1 minute window with a query with a 2 minute window with a 1s time "buffer" you will only an output every 2 minutes. UNION operator will be better for this.
From your sample and your goal, I think there is a much easier way to write this query using UDA (User Defined Aggregate).
For this I will define a UDA function called "ratio" first:
Then I can simply use this SQL query for a 60 seconds window: