BigQuery - Partitioning data according to some has

2019-05-24 08:54发布

I have a table in BigQuery. I have a certain string column which represents a unique id (uid). I want to filter only a sample of this table, by taking only a portion of the uids (let's say 1/100). So my idea is to sample the data by doing something like this:

if(ABS(HASH(uid)) % 100 == 0) ...

The problem is this will actually filter in 1/100 ratio only if the distribution of the hash values is uniform. So, in order to check that, I would like to generate the following table:

(n goes from 0 to 99)

0    <number of rows in which uid % 100 == 0>
1    <number of rows in which uid % 100 == 1>
2    <number of rows in which uid % 100 == 2>
3    <number of rows in which uid % 100 == 3>

.. etc.

If I see the numbers in each row are of the same magnitude, then my assumption is correct.

Any idea how to create such a query, or alternatively do the sampling another way?

标签： hash google-bigquery

2条回答

放荡不羁爱自由

2楼-- · 2019-05-24 09:37

As an alternative to HASH(), you can try RAND() - it doesn't depend on the ids being uniformly distributed.

For example, this would give you 10 roughly equally sized partitions:

SELECT word, INTEGER(10*RAND()) part
FROM [publicdata:samples.shakespeare]

Verification:

SELECT part, COUNT(*) FROM (
    SELECT word, INTEGER(10*RAND()) part
    FROM [publicdata:samples.shakespeare]
)
GROUP BY part
ORDER BY part

Each group ends up with about 16465 elements.

0人赞添加讨论(0) 举报

▲ chillily

3楼-- · 2019-05-24 09:44

Something like

Select ABS(HASH(uid)) % 100 as cluster , count(*) as cnt 
From yourtable 
Group each by cluster

the UID is of different cases (upper, lower) and types you can use some string manipulation within the hash. something like:

 Select ABS(HASH(upper(string(uid)))) % 100 as cluster , count(*) as cnt 
From yourtable 
Group each by cluster

0人赞添加讨论(0) 举报

BigQuery - Partitioning data according to some has

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间