I have a table in BigQuery. I have a certain string column which represents a unique id (uid). I want to filter only a sample of this table, by taking only a portion of the uids (let's say 1/100). So my idea is to sample the data by doing something like this:
if(ABS(HASH(uid)) % 100 == 0) ...
The problem is this will actually filter in 1/100 ratio only if the distribution of the hash values is uniform. So, in order to check that, I would like to generate the following table:
(n goes from 0 to 99)
0 <number of rows in which uid % 100 == 0>
1 <number of rows in which uid % 100 == 1>
2 <number of rows in which uid % 100 == 2>
3 <number of rows in which uid % 100 == 3>
.. etc.
If I see the numbers in each row are of the same magnitude, then my assumption is correct.
Any idea how to create such a query, or alternatively do the sampling another way?
As an alternative to HASH(), you can try RAND() - it doesn't depend on the ids being uniformly distributed.
For example, this would give you 10 roughly equally sized partitions:
Verification:
Each group ends up with about 16465 elements.
Something like
the UID is of different cases (upper, lower) and types you can use some string manipulation within the hash. something like: