BigQuery - Partitioning data according to some has

2019-05-24 09:21发布

问题:

I have a table in BigQuery. I have a certain string column which represents a unique id (uid). I want to filter only a sample of this table, by taking only a portion of the uids (let's say 1/100). So my idea is to sample the data by doing something like this:

if(ABS(HASH(uid)) % 100 == 0) ...

The problem is this will actually filter in 1/100 ratio only if the distribution of the hash values is uniform. So, in order to check that, I would like to generate the following table:

(n goes from 0 to 99)

0    <number of rows in which uid % 100 == 0>
1    <number of rows in which uid % 100 == 1>
2    <number of rows in which uid % 100 == 2>
3    <number of rows in which uid % 100 == 3>

.. etc.

If I see the numbers in each row are of the same magnitude, then my assumption is correct.

Any idea how to create such a query, or alternatively do the sampling another way?

回答1:

Something like

Select ABS(HASH(uid)) % 100 as cluster , count(*) as cnt 
From yourtable 
Group each by cluster 

the UID is of different cases (upper, lower) and types you can use some string manipulation within the hash. something like:

 Select ABS(HASH(upper(string(uid)))) % 100 as cluster , count(*) as cnt 
From yourtable 
Group each by cluster 


回答2:

As an alternative to HASH(), you can try RAND() - it doesn't depend on the ids being uniformly distributed.

For example, this would give you 10 roughly equally sized partitions:

SELECT word, INTEGER(10*RAND()) part
FROM [publicdata:samples.shakespeare]

Verification:

SELECT part, COUNT(*) FROM (
    SELECT word, INTEGER(10*RAND()) part
    FROM [publicdata:samples.shakespeare]
)
GROUP BY part
ORDER BY part

Each group ends up with about 16465 elements.