-->

Weighted sum of a column vector and a derived bit

2020-05-06 09:13发布

问题:

We have a table of bid prices and sizes of two buyers. Bid price p with size s means that the buyer is open to buy s number of product at price p. We have a table that contains a few columns (like timestamp, validity flag) together with these four columns:

  • bid prices offered by the two buyers, pA and pB.
  • bid sizes, sA and sB.

Our job is to add a new best size column (bS) to the table, that returns the size at the best price. If the two buyers have the same price then bS is equal to sA + sB, otherwise, we need to take the bid size of the buyer that offers the higher price.

An example table (ignoring columns that are neither prices nor sizes) with the desired output is below.

A simple solution to the problem:

SELECT *,
  CASE
    WHEN pA = pB THEN sA + sB
    WHEN pA > pB THEN sA
    ELSE sB
  END AS bS
FROM t

Now let us generalize the problem to four buyers. A standard SQL solution is

WITH t_ext AS (
SELECT *, GREATEST(pA, pB, pC, pD) as bP
FROM `t` 
)
SELECT *, (sA * CAST(pA = bP AS INT64) + 
           sB * CAST(pB = bP AS INT64) + 
           sC * CAST(pC = bP AS INT64) +
           sD * CAST(pD = bP AS INT64)) 
AS bS FROM t_ext

Question:

Is there a simplified query that

  • uses function SUM instead of adding four items manually
  • avoids repeated casting?

Note that we cannot identify the price and size columns by indices but only by name. Otherwise, we could use the solution proposed at

Weighted sum of a column vector and a derived bit vector

Btw. I wrote a blog post about this problem that focuses on solutions in Python and Q and I am wondering how the best solution in standard sql looks like.

回答1:

Below is for BigQuery Standard SQL

Note that we cannot identify the price and size columns by indices but only by name

#standardSQL
WITH t_ext AS (
  SELECT * EXCEPT(arr), 
    ARRAY(SELECT CAST(val AS INT64) FROM UNNEST(arr) val WITH OFFSET WHERE OFFSET < ARRAY_LENGTH(arr) / 2) AS prices,
    ARRAY(SELECT CAST(val AS INT64) FROM UNNEST(arr) val WITH OFFSET WHERE OFFSET >= ARRAY_LENGTH(arr) / 2) AS sizes,
    (SELECT MAX(CAST(val AS INT64)) FROM UNNEST(arr) val WITH OFFSET WHERE OFFSET < ARRAY_LENGTH(arr) / 2) AS bestPrice
  FROM (
    SELECT *, REGEXP_EXTRACT_ALL(TO_JSON_STRING(T), r'(?:"(?:pA|pB|pC|pD|sA|sB|sC|sD)"):(\d+)') AS arr
    FROM `project.dataset.table` t
  )
)
SELECT * EXCEPT(prices, sizes), 
  (SELECT SUM(size)
    FROM UNNEST(prices) price WITH OFFSET
    JOIN UNNEST(sizes) size WITH OFFSET
    USING(OFFSET) 
    WHERE price = bestPrice
  ) AS bS
FROM t_ext

As you can see - the only what you should supply is the list of price and size column names as in below example

pA|pB|pC|pD|sA|sB|sC|sD    

If to apply to dummy data as below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'a' id, 1 pA, 2 pB, 3 pC, 4 pD, 'x' extra_col1, 1 sA, 1 sB, 1 sC, 5 sD UNION ALL
  SELECT 'b', 1, 4, 2, 4, 'y', 1, 6, 1, 5 UNION ALL
  SELECT 'c', 5, 4, 2, 1, 'z', 7, 1, 1, 1
), t_ext AS (
  SELECT * EXCEPT(arr), 
    ARRAY(SELECT CAST(val AS INT64) FROM UNNEST(arr) val WITH OFFSET WHERE OFFSET < ARRAY_LENGTH(arr) / 2) AS prices,
    ARRAY(SELECT CAST(val AS INT64) FROM UNNEST(arr) val WITH OFFSET WHERE OFFSET >= ARRAY_LENGTH(arr) / 2) AS sizes,
    (SELECT MAX(CAST(val AS INT64)) FROM UNNEST(arr) val WITH OFFSET WHERE OFFSET < ARRAY_LENGTH(arr) / 2) AS bestPrice
  FROM (
    SELECT *, REGEXP_EXTRACT_ALL(TO_JSON_STRING(T), r'(?:"(?:pA|pB|pC|pD|sA|sB|sC|sD)"):(\d+)') AS arr
    FROM `project.dataset.table` t
  )
)
SELECT * EXCEPT(prices, sizes), 
  (SELECT SUM(size)
    FROM UNNEST(prices) price WITH OFFSET
    JOIN UNNEST(sizes) size WITH OFFSET
    USING(OFFSET) 
    WHERE price = bestPrice
  ) AS bS
FROM t_ext

result is

Row id  pA  pB  pC  pD  extra_col1  sA  sB  sC  sD  bestPrice   bS   
1   a   1   2   3   4   x           1   1   1   5   4           5    
2   b   1   4   2   4   y           1   6   1   5   4           11   
3   c   5   4   2   1   z           7   1   1   1   5           7      

Hope, this is what you are looking for