Sqoop Import Split by Column Data type

2019-02-15 08:19发布

Should the datatype of Split by column in sqoop import always be a number datatype (integer, bignint, numeric)? Can't it be a string?

标签: hive sqoop
2条回答
冷血范
2楼-- · 2019-02-15 08:38

No, it must be numeric because according to the specs: "By default , sqoop will use query select min(), max() from to find out boundaries for creating splits." The alternative is to use --boundary-query which also requires numeric columns. Otherwise , the Sqoop job will fail. If you don't have such a column in your table the only workaround is to use only 1 mapper: "-m 1".

查看更多
看我几分像从前
3楼-- · 2019-02-15 08:54

Yes you can split on any non numeric datatype.

But this is not recommended.

WHY?

For splitting data Sqoop fires

SELECT MIN(col1), MAX(col2) FROM TABLE

then divide it as per you number of mappers.

Now take an example of integer as --split-by column

Table has some id column having value 1 to 100 and you using 4 mappers (-m 4 in your sqoop command)

Sqoop get MIN and MAX value using:

SELECT MIN(id), MAX(id) FROM TABLE

OUTPUT:

1,100

Splitting on integer is easy. You will make 4 parts:

  • 1-25
  • 25-50
  • 51-75
  • 76-100

Now string as --split-by column

Table has some name column having value "dev" to "sam" and you using 4 mappers (-m 4 in your sqoop command)

Sqoop get MIN and MAX value using:

SELECT MIN(id), MAX(id) FROM TABLE

OUTPUT:

dev,sam

Now how will it be divided in 4 parts. As per sqoop docs,

/**
   * This method needs to determine the splits between two user-provided
   * strings.  In the case where the user's strings are 'A' and 'Z', this is
   * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
   * splits for strings beginning with each letter, etc.
   *
   * If a user has provided us with the strings "Ham" and "Haze", however, we
   * need to create splits that differ in the third letter.
   *
   * The algorithm used is as follows:
   * Since there are 2**16 unicode characters, we interpret characters as
   * digits in base 65536. Given a string 's' containing characters s_0, s_1
   * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
   * base 65536. Having mapped the low and high strings into floating-point
   * values, we then use the BigDecimalSplitter to establish the even split
   * points, then map the resulting floating point values back into strings.
   */

And you will see the warning in the code:

LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
    + "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");

In case of Integer example, all the mappers will get balanced load (all will fetch 25 records from RDBMS).

In case of string, there is less probability that data is sorted. So, it's difficult to give similar loads to all the mappers.


In a nutshell, Go for integer column as --split-by column.

查看更多
登录 后发表回答