Questions about Hive

I have this environment:

Haddop environment (1 master, 4 slaves) with several applications: ambari, hue, hive, sqoop, hdfs ... Server in production (separate from hadoop) with mysql database.

My goal is:

Optimize the queries made on this mysql server that are slow to execute today.

What did I do:

I imported the mysql data to HDFS using Sqoop.

My doubts:

I can not make selects direct in HDFS using Hive?
Do I have to load the data into Hive and make the queries?
If new data is entered into the mysql database, what is the best way to get this data and insert it into HDFS and then insert it into Hive again? (Maybe in real time)

Thank you in advance

标签： hadoop hive hdfs sqoop

2条回答

仙女界的扛把子

2楼-- · 2019-09-14 23:11

You can try Impala which is much faster than Hive in case of SQL queries. You need to define tables most probably specifying some delimiter, storage format and where the data is stored on HDFS (I don't know what kind of data are you storing). Then you can write SQL queries which will take the data from HDFS.

I have no experience with real-time data ingestion from relational databases, however you can try scheduling Sqoop jobs with cron.

0人赞添加讨论(0) 举报

做自己的国王

3楼-- · 2019-09-14 23:22

I can not make selects direct in HDFS using Hive?

You can. Create External Table in hive specifying your hdfs location. Then you can perform any HQL over it.

Do I have to load the data into Hive and make the queries?

In case of external table, you don't need to load data in hive; your data resides in the same HDFS directory.

If new data is entered into the mysql database, what is the best way to get this data.

You can use Sqoop Incremental Import for this. It will fetch only newly added/updated data (depending upon incremental mode). You can create a sqoop job and schedule it as per your need.

0人赞添加讨论(0) 举报

Questions about Hive

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间