Should the datatype of Split by column in sqoop import always be a number datatype (integer, bignint, numeric)? Can't it be a string?
相关问题
-
hive: cast array
> into map - Find function in HIVE
- Hive Tez reducers are running super slow
- Set parquet snappy output file size is hive?
- Hive 'cannot alter table' error
相关文章
- 在hive sql里怎么把"2020-10-26T08:41:19.000Z"这个字符串转换成年月日
- SQL query Frequency Distribution matrix for produc
- Cloudera 5.6: Parquet does not support date. See H
- converting to timestamp with time zone failed on A
- Hive error: parseexception missing EOF
- ClassNotFoundException: org.apache.spark.SparkConf
- How to get previous day date in Hive
- Hive's hour() function returns 12 hour clock v
No, it must be numeric because according to the specs: "By default , sqoop will use query select min(), max() from to find out boundaries for creating splits." The alternative is to use --boundary-query which also requires numeric columns. Otherwise , the Sqoop job will fail. If you don't have such a column in your table the only workaround is to use only 1 mapper: "-m 1".
Yes you can split on any non numeric datatype.
But this is not recommended.
WHY?
For splitting data Sqoop fires
then divide it as per you number of mappers.
Now take an example of integer as
--split-by
columnTable has some
id
column having value 1 to 100 and you using 4 mappers (-m 4
in your sqoop command)Sqoop get MIN and MAX value using:
OUTPUT:
1,100
Splitting on integer is easy. You will make 4 parts:
Now string as
--split-by
columnTable has some
name
column having value "dev" to "sam" and you using 4 mappers (-m 4
in your sqoop command)Sqoop get MIN and MAX value using:
OUTPUT:
dev,sam
Now how will it be divided in 4 parts. As per sqoop docs,
And you will see the warning in the code:
In case of Integer example, all the mappers will get balanced load (all will fetch 25 records from RDBMS).
In case of string, there is less probability that data is sorted. So, it's difficult to give similar loads to all the mappers.
In a nutshell, Go for integer column as
--split-by
column.