Hive: Creating smaller table from big table

2019-04-11 19:34发布

I currently have a Hive table that has 1.5 billion rows. I would like to create a smaller table (using the same table schema) with about 1 million rows from the original table. Ideally, the new rows would be randomly sampled from the original table, but getting the top 1M or bottom 1M of the original table would be ok, too. How would I do this?

标签： hadoop hive

3条回答

放荡不羁爱自由

2楼-- · 2019-04-11 19:53

As climbage suggested earlier, you could probably best use Hive's built-in sampling methods.

INSERT OVERWRITE TABLE my_table_sample 
SELECT * FROM my_table 
TABLESAMPLE (1m ROWS) t;

This syntax was introduced in Hive 0.11. If you are running an older version of Hive, you'll be confined to using the PERCENT syntax like so.

INSERT OVERWRITE TABLE my_table_sample 
SELECT * FROM my_table 
TABLESAMPLE (1 PERCENT) t;

You can change the percentage to match you specific sample size requirements.

0人赞添加讨论(0) 举报

孤傲高冷的网名

3楼-- · 2019-04-11 19:54

This query will pull out top 1M rows and overwrite them in a new table.

CREATE TABLE new_table_name AS 
  SELECT col1, col2, col3, ....
  FROM original_table 
  WHERE (if you want to put any condition) limit 100000;

0人赞添加讨论(0) 举报

小情绪 Triste *

4楼-- · 2019-04-11 20:09

You can define a new table with the same schema as your original table.

Then use INSERT OVERWRITE TABLE <tablename> <select statement>

The SELECT statement will need to query your original table, use LIMIT to only get 1M results.

0人赞添加讨论(0) 举报

Hive: Creating smaller table from big table

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间