I just discovered that the RAND() function, while undocumented, works in BigQuery. I was able to generate a (seemingly) random sample of 10 words from the Shakespeare dataset using:

SELECT word FROM
(SELECT rand() as random,word FROM [publicdata:samples.shakespeare] ORDER BY random)
LIMIT 10

My question is: Are there any disadvantages to using this approach instead of the HASH() method defined in the "Advanced examples" section of the reference manual? https://developers.google.com/bigquery/query-reference

标签： google-cloud-platform google-bigquery

3条回答

萌系小妹纸

2楼-- · 2019-01-16 18:36

Great to know RAND() is available!

In my case I needed a predefined sample size. Instead of needing to know the total number of rows and do the division sample size over total rows, I'm using the following query:

SELECT word, rand(5) as rand
FROM [publicdata:samples.shakespeare]
order by rand
#Sample size needed = 10
limit 10

Summarizing, I use ORDER BY + LIMIT to ramdomize and then extract a defined number of samples.

0人赞添加讨论(0) 举报

家丑人穷心不美

3楼-- · 2019-01-16 18:37

One additional tip to make it even simpler: You can order by the function it self, ie:

select x from y order by rand() limit 100

=> Sample of 100

0人赞添加讨论(0) 举报

Deceive 欺骗

4楼-- · 2019-01-16 18:47

For stratified sampling, check https://stackoverflow.com/a/52901452/132438

Good job finding it :). I requested the function recently, but it hasn't made it to documentation yet.

I would say the advantage of RAND() is that the results will vary, while HASH() will keep giving you the same results for the same values (not guaranteed over time, but you get the idea).

In case you want the variability that RAND() brings while still getting consistent results - you can seed it with an integer, as in RAND(3).

Notice though that the example you pasted is doing a full sort of the random values - for sufficiently big inputs this approach won't scale.

A scalable approach, to get around 10 random rows:

SELECT word
FROM [publicdata:samples.shakespeare]
WHERE RAND() < 10/164656

(where 10 is the approximate number of results I want to get, and 164656 the number of rows that table has)

standardSQL update:

#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/164656

or even:

#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/(SELECT COUNT(*) FROM `publicdata.samples.shakespeare`)

0人赞添加讨论(0) 举报

Random Sampling in Google BigQuery

standardSQL update:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间