Now Im using Hive-JDBC-Storage-Handler to create table on Hive:
Reference: https://github.com/qubole/Hive-JDBC-Storage-Handler
Hadoop Info:
- Hadoop 2.7.3
- Hive 1.2.1
Schema:
CREATE EXTERNAL TABLE tbl_google_stats_adgroups_summary
(
campaignid BIGINT COMMENT 'from deserializer',
adgroupid BIGINT COMMENT 'from deserializer',
localadgroupid BIGINT COMMENT 'from deserializer',
position FLOAT COMMENT 'from deserializer',
cost FLOAT COMMENT 'from deserializer',
impression INT COMMENT 'from deserializer',
clicks INT COMMENT 'from deserializer',
conversions INT COMMENT 'from deserializer',
conversionsbydate INT COMMENT 'from deserializer',
uniqueconversions INT COMMENT 'from deserializer',
uniqueconversionsbydate INT COMMENT 'from deserializer',
datestats TIMESTAMP COMMENT 'from deserializer',
quantity INT COMMENT 'from deserializer',
quantitybydate INT COMMENT 'from deserializer',
revenue FLOAT COMMENT 'from deserializer',
revenuebydate FLOAT COMMENT 'from deserializer',
uniquerevenue FLOAT COMMENT 'from deserializer',
uniquerevenuebydate FLOAT COMMENT 'from deserializer',
deviceid INT COMMENT 'from deserializer',
conv1perclick INT COMMENT 'from deserializer',
adwordstype INT COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcSerDe'
STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) TBLPROPERTIES ( 'mapred.jdbc.driver.class' = 'com.mysql.jdbc.Driver' , 'mapred.jdbc.hive.lazy.split' = 'false' ,
'mapred.jdbc.input.table.name' = 'tbl_adgroup' ,
'mapred.jdbc.password' = '' , 'mapred.jdbc.url' = 'jdbc:mysql://localhost:3306/<databae_name>?useUnicode=true&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC' , 'mapred.jdbc.username' = 'root' )
Data is synced from Mysql well and then i used Apache Kylin to build a demo cube with this table. However the first state of building cube process: Create Intermediate Flat Hive Table was fail, and fail query was:
INSERT OVERWRITE TABLE default.kylin_intermediate_test_cube_adgroup_mysql_164b0ca3_6050_49bb_838b_49ee49f6d1e5 SELECT
TBL_GOOGLE_STATS_ADGROUPS_SUMMARY.CAMPAIGNID
,TBL_GOOGLE_STATS_ADGROUPS_SUMMARY.ADGROUPID
,TBL_GOOGLE_STATS_ADGROUPS_SUMMARY.POSITION
,TBL_GOOGLE_STATS_ADGROUPS_SUMMARY.DATESTATS
,TBL_GOOGLE_STATS_ADGROUPS_SUMMARY.DEVICEID
,TBL_GOOGLE_STATS_ADGROUPS_SUMMARY.ADWORDSTYPE
,TBL_GOOGLE_STATS_ADGROUPS_SUMMARY.COST
FROM <database>.TBL_GOOGLE_STATS_ADGROUPS_SUMMARY as TBL_GOOGLE_STATS_ADGROUPS_SUMMARY ;
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1478248621961_0005_1_00, diagnostics=[Task failed, taskId=task_1478248621961_0005_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: InputFormatWrapper can not support RecordReaders that don't return same key & value objects. current reader class : class org.apache.hadoop.mapreduce.lib.db.MySQLDBRecordReader
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: InputFormatWrapper can not support RecordReaders that don't return same key & value objects. current reader class : class org.apache.hadoop.mapreduce.lib.db.MySQLDBRecordReader
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:325)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
... 14 more
Anyone can help me? Beside i have a wondering about structure of hadoop system:
Now we need a transaction SQL Engine like MySQL because some out report data need to update data. Hive provides ACID table however it not support UPDATE INNER JOIN,... and this thing is really problem with our bussiness. So thats why i setup JDBCStorageHandler. So Can this structure handle billion rows of data? Thanks a lot!