Sqoop Import Split by Column Data type

Should the datatype of Split by column in sqoop import always be a number datatype (integer, bignint, numeric)? Can't it be a string?

标签： hive sqoop

2条回答

冷血范

2楼-- · 2019-02-15 08:38

No, it must be numeric because according to the specs: "By default , sqoop will use query select min(), max() from to find out boundaries for creating splits." The alternative is to use --boundary-query which also requires numeric columns. Otherwise , the Sqoop job will fail. If you don't have such a column in your table the only workaround is to use only 1 mapper: "-m 1".

0人赞添加讨论(0) 举报

看我几分像从前

3楼-- · 2019-02-15 08:54

Yes you can split on any non numeric datatype.

But this is not recommended.

WHY?

For splitting data Sqoop fires

SELECT MIN(col1), MAX(col2) FROM TABLE

then divide it as per you number of mappers.

Now take an example of integer as --split-by column

Table has some id column having value 1 to 100 and you using 4 mappers (-m 4 in your sqoop command)

Sqoop get MIN and MAX value using:

SELECT MIN(id), MAX(id) FROM TABLE

OUTPUT:

1,100

Splitting on integer is easy. You will make 4 parts:

1-25
25-50
51-75
76-100

Now string as --split-by column

Table has some name column having value "dev" to "sam" and you using 4 mappers (-m 4 in your sqoop command)

Sqoop get MIN and MAX value using:

SELECT MIN(id), MAX(id) FROM TABLE

OUTPUT:

dev,sam

Now how will it be divided in 4 parts. As per sqoop docs,

/**
   * This method needs to determine the splits between two user-provided
   * strings.  In the case where the user's strings are 'A' and 'Z', this is
   * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
   * splits for strings beginning with each letter, etc.
   *
   * If a user has provided us with the strings "Ham" and "Haze", however, we
   * need to create splits that differ in the third letter.
   *
   * The algorithm used is as follows:
   * Since there are 2**16 unicode characters, we interpret characters as
   * digits in base 65536. Given a string 's' containing characters s_0, s_1
   * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
   * base 65536. Having mapped the low and high strings into floating-point
   * values, we then use the BigDecimalSplitter to establish the even split
   * points, then map the resulting floating point values back into strings.
   */

And you will see the warning in the code:

LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
    + "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");

In case of Integer example, all the mappers will get balanced load (all will fetch 25 records from RDBMS).

In case of string, there is less probability that data is sorted. So, it's difficult to give similar loads to all the mappers.

In a nutshell, Go for integer column as --split-by column.

0人赞添加讨论(0) 举报

Sqoop Import Split by Column Data type

WHY?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间