I know the hashing principal for HashMap in Java, so wanted to know that how the hashing works for the Hive while we bucketing the data in various bucket.
相关问题
- Spark on Yarn Container Failure
- enableHiveSupport throws error in java spark code
- spark select and add columns with alias
- Unable to generate jar file for Hadoop
-
hive: cast array
> into map
相关文章
- 在hive sql里怎么把"2020-10-26T08:41:19.000Z"这个字符串转换成年月日
- Java写文件至HDFS失败
- mapreduce count example
- SQL query Frequency Distribution matrix for produc
- Cloudera 5.6: Parquet does not support date. See H
- Could you give me any clue Why 'Cannot call me
- converting to timestamp with time zone failed on A
- Hive error: parseexception missing EOF
I recently had to dig into some Hive source code to figure this out for myself. Here's what I found:
For an integer field, the hash is just the integer value. For a string, it uses a similar version of Java's String hashCode. When hashing multiple values, the hash is a similar version of Java’s List hashCode.
Bucketing is used along with partitioning to have more decomposed structure for future analysis. As more partitions result in more hdfs files which can affect namenode performance, we resort to bucketing. The way bucketing actually works is : The number of buckets is determined by hashFunction(bucketingColumn) mod numOfBuckets numOfBuckets is chose when you create the table with partitioning. The hash function output depends on the type of the column choosen. To accurately set the number of reducers while bucketing and land the data appropriately, we use "hive.enforce.bucketing = true". Please refer to this, for more information