Spark JDBC to DashDB (DB2) with CLOB errors

2019-05-15 19:26发布

问题:

I am working to connect my spark application to DashDB. Currently, I can load my data just fine.

However, I am unable to save a DataFrame to DashDB.

Any insight will be helpful.

  var jdbcSets = sqlContext.read.format("jdbc").options(Map("url" -> url, "driver" -> driver, "dbtable" -> "setsrankval")).load()
  jdbcSets.registerTempTable("setsOpponentRanked")
  jdbcSets = jdbcSets.coalesce(10)
  sqlContext.cacheTable("setsOpponentRanked")

However, when I try to save large DataFrames, I get the error:

DB2 SQL Error: SQLCODE=-1666, SQLSTATE=42613, SQLERRMC=CLOB, DRIVER=4.19.26

The code I use to save the data is as follows:

val writeproperties = new Properties()
  writeproperties.setProperty("user", "dashXXXX")
  writeproperties.setProperty("password", "XXXXXX")
  writeproperties.setProperty("rowId", "false")
  writeproperties.setProperty("driver", "com.ibm.db2.jcc.DB2Driver")
  results.write.mode(SaveMode.Overwrite).jdbc(writeurl, "players_stat_temp", writeproperties)

An example test data set can be seen here:

println("Test set: "+results.first()) 
Test set: ['Damir DZUMHUR','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test','test',null,null,null,null,null,null,null]

The DataFrame schema is as follows:

    root
 |-- PLAYER: string (nullable = true)
 |-- set01: string (nullable = true)
 |-- set02: string (nullable = true)
 |-- set12: string (nullable = true)
 |-- set01weakseed: string (nullable = true)
 |-- set01medseed: string (nullable = true)
 |-- set01strongseed: string (nullable = true)
 |-- set02weakseed: string (nullable = true)
 |-- set02medseed: string (nullable = true)
 |-- set02strongseed: string (nullable = true)
 |-- set12weakseed: string (nullable = true)
 |-- set12medseed: string (nullable = true)
 |-- set12strongseed: string (nullable = true)
 |-- set01weakrank: string (nullable = true)
 |-- set01medrank: string (nullable = true)
 |-- set01strongrank: string (nullable = true)
 |-- set02weakrank: string (nullable = true)
 |-- set02medrank: string (nullable = true)
 |-- set02strongrank: string (nullable = true)
 |-- set12weakrank: string (nullable = true)
 |-- set12medrank: string (nullable = true)
 |-- set12strongrank: string (nullable = true)
 |-- minibreak: string (nullable = true)
 |-- minibreakweakseed: string (nullable = true)
 |-- minibreakmedseed: string (nullable = true)
 |-- minibreakstrongseed: string (nullable = true)
 |-- minibreakweakrank: string (nullable = true)
 |-- minibreakmedrank: string (nullable = true)
 |-- minibreakstrongrank: string (nullable = true)

I have looked at the jdbc DB2Dialect and see that the code for StringType gets mapped to CLOB. I wonder if the following will help:

private object DB2CustomDialect extends JdbcDialect {
    override def canHandle(url: String): Boolean = url.startsWith("jdbc:db2")
    override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
            case StringType => Option(JdbcType("VARCHAR(10000)", java.sql.Types.VARCHAR))
            case BooleanType => Option(JdbcType("CHAR(1)", java.sql.Types.CHAR))
            case _ => None
    }
}

回答1:

Yes, indeed the current implementation of DB2Dialect is not optimal. We will take a look and will probably create a pull request to change StringType mapping to VARCHAR.

I guess the thought to use CLOB at the fist place was that this is safer when you want to be able to store strings of all lengths, while VARCHAR is limited by the DB2/dashDB page size at maximum. But I think it is unlikely that one will put very long strings into properties of a data frame and CLOB causes all kind of operational trouble, e.g. the fact that it is not supported in COLUMN ORGANIZED tables in DB2, which is the default table type in dashDB and that's why you face the problem when trying to write your data frame to dashDB. But CLOB also is problematic for IO performance in DB2 since it is not necessarily cached in database memory like all other table data.

What you can do as workaround for now is indeed just register an own custom dialect as you propose above with JdbcDialects.registerDialect() until the mentioned pull request will be accepted.



回答2:

Works well by adding a custom dialect.

JdbcDialects.registerDialect(new DB2CustomDialect())


回答3:

Note, DSX (i.e. datascience.ibm.com) has not fixed this issue out of the box. So deploying the custom dialect is now not required anymore when using notebooks in DSX with dashDB.