How to sample for each group in hive?

2019-07-18 14:39发布

I have a large table in hive that has 1.5 bil+ values. One of the columns is category_id, which has ~20 distinct values. I want to sample the table such that I have 1 mil values for each category.

I checked out Random sample table with Hive, but including matching rows and Hive: Creating smaller table from big table and I figured out how to get a random sample from the entire table, but I'm still unable to figure out how to get a sample for each category_id.

标签： hadoop hive hiveql

1条回答

\"骚年 ilove

2楼-- · 2019-07-18 15:35

I understand you want to sample your table in multiple files. You might want to check Hive bucketing or Dynamic partitions to balance your records between multiple folder/files.

0人赞添加讨论(0) 举报

How to sample for each group in hive?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间