Migrate hive table to Google BigQuery

2019-01-26 17:28发布

I am trying to design a sort of data pipeline to migrate my Hive tables into BigQuery. Hive is running on an Hadoop on premise cluster. This is my current design, actually, it is very easy, it is just a shell script:

for each table source_hive_table {

INSERT overwrite table target_avro_hive_table SELECT * FROM source_hive_table;
Move the resulting avro files into google cloud storage using distcp
Create first BQ table: bq load --source_format=AVRO your_dataset.something something.avro
Handle any casting issue from BigQuery itself, so selecting from the table just written and handling manually any casting

}

Do you think it makes sense? Is there any better way, perhaps using Spark? I am not happy about the way I am handling the casting, I would like to avoid creating the BigQuery table twice.

标签： hadoop hive google-bigquery google-cloud-platform

1条回答

看我几分像从前

2楼-- · 2019-01-26 18:19

Yes, your migration logic makes sense.

I personally prefer to do the CAST for specific types directly into the initial "Hive query" that generates your Avro (Hive) data. For instance, "decimal" type in Hive maps to the Avro 'type': "type":"bytes","logicalType":"decimal","precision":10,"scale":2

And BQ will just take the primary type (here "bytes") instead of the logicalType. So that is why I find it easier to cast directly in Hive (here to "double"). Same problem happens to the date-hive type.

0人赞添加讨论(0) 举报

Migrate hive table to Google BigQuery

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间