Skewed tables in Hive

2019-03-28 13:06发布

I am learning hive and came across skewed tables. Help me understanding it.

What are skewed tables in Hive?

How do we create skewed tables?

How does it effect performance?

2条回答
做自己的国王
2楼-- · 2019-03-28 13:33

In Skewed Tables, partition will be created for the column value which has many records and rest of the data will be moved to another partition. Hence number of partitions, number of mappers and number of intermediate files will be reduced. For ex: out of 100 patients, 90 patients have high BP and other 10 patients have fever, cold, cancer etc. So one partition will be created for 90 patients and one partition will be created for other 10 patients. I hope this will answer your question.

查看更多
Fickle 薄情
3楼-- · 2019-03-28 13:49

What are skewed tables in Hive?

A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file..

How do we create skewed tables?

create table <T> (schema) skewed by (keys) on ('value1', 'value2') [STORED as DIRECTORIES];

Example :

create table T (c1 string, c2 string) skewed by (c1) on ('x1')

How does it affect performance?

By specifying the skewed values Hive will split those out into separate files automatically and take this fact into account during queries so that it can skip (or include) whole files if possible thus enhancing the performance.

EDIT :

x1 is actually the value on which column c1 is skewed. You can have multiple such values for multiple columns. For example,

create table T (c1 string, c2 string) skewed by (c1) on ('x1', 'x2', 'x3')

Advantage of having such a setup is that for the values that appear more frequently than other values get split out into separate files(or separate directories if we are using STORED AS DIRECTORIES clause). And this information is used by the execution engine during query execution to make processing more efficient.

查看更多
登录 后发表回答