I currently have a Hive table that has 1.5 billion rows. I would like to create a smaller table (using the same table schema) with about 1 million rows from the original table. Ideally, the new rows would be randomly sampled from the original table, but getting the top 1M or bottom 1M of the original table would be ok, too. How would I do this?
相关问题
- Spark on Yarn Container Failure
- enableHiveSupport throws error in java spark code
- spark select and add columns with alias
- Unable to generate jar file for Hadoop
-
hive: cast array
> into map
相关文章
- 在hive sql里怎么把"2020-10-26T08:41:19.000Z"这个字符串转换成年月日
- Java写文件至HDFS失败
- mapreduce count example
- SQL query Frequency Distribution matrix for produc
- Cloudera 5.6: Parquet does not support date. See H
- Could you give me any clue Why 'Cannot call me
- converting to timestamp with time zone failed on A
- Hive error: parseexception missing EOF
As climbage suggested earlier, you could probably best use Hive's built-in sampling methods.
This syntax was introduced in Hive 0.11. If you are running an older version of Hive, you'll be confined to using the
PERCENT
syntax like so.You can change the percentage to match you specific sample size requirements.
This query will pull out top 1M rows and overwrite them in a new table.
You can define a new table with the same schema as your original table.
Then use
INSERT OVERWRITE TABLE <tablename> <select statement>
The SELECT statement will need to query your original table, use LIMIT to only get 1M results.