I am learning hive and came across skewed tables. Help me understanding it.
What are skewed tables in Hive?
How do we create skewed tables?
How does it effect performance?
I am learning hive and came across skewed tables. Help me understanding it.
What are skewed tables in Hive?
How do we create skewed tables?
How does it effect performance?
In Skewed Tables, partition will be created for the column value which has many records and rest of the data will be moved to another partition. Hence number of partitions, number of mappers and number of intermediate files will be reduced. For ex: out of 100 patients, 90 patients have high BP and other 10 patients have fever, cold, cancer etc. So one partition will be created for 90 patients and one partition will be created for other 10 patients. I hope this will answer your question.
What are skewed tables in Hive?
A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file..
How do we create skewed tables?
Example :
How does it affect performance?
By specifying the skewed values Hive will split those out into separate files automatically and take this fact into account during queries so that it can skip (or include) whole files if possible thus enhancing the performance.
EDIT :
x1 is actually the value on which column c1 is skewed. You can have multiple such values for multiple columns. For example,
Advantage of having such a setup is that for the values that appear more frequently than other values get split out into separate files(or separate directories if we are using STORED AS DIRECTORIES clause). And this information is used by the execution engine during query execution to make processing more efficient.