可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have this environment:

Haddop environment (1 master, 4 slaves) with several applications: ambari, hue, hive, sqoop, hdfs ... Server in production (separate from hadoop) with mysql database.

My goal is:

Optimize the queries made on this mysql server that are slow to execute today.

What did I do:

I imported the mysql data to HDFS using Sqoop.

My doubts:

I can not make selects direct in HDFS using Hive?
Do I have to load the data into Hive and make the queries?
If new data is entered into the mysql database, what is the best way to get this data and insert it into HDFS and then insert it into Hive again? (Maybe in real time)

Thank you in advance

回答1:

I can not make selects direct in HDFS using Hive?

You can. Create External Table in hive specifying your hdfs location. Then you can perform any HQL over it.

Do I have to load the data into Hive and make the queries?

In case of external table, you don't need to load data in hive; your data resides in the same HDFS directory.

If new data is entered into the mysql database, what is the best way to get this data.

You can use Sqoop Incremental Import for this. It will fetch only newly added/updated data (depending upon incremental mode). You can create a sqoop job and schedule it as per your need.

回答2:

You can try Impala which is much faster than Hive in case of SQL queries. You need to define tables most probably specifying some delimiter, storage format and where the data is stored on HDFS (I don't know what kind of data are you storing). Then you can write SQL queries which will take the data from HDFS.

I have no experience with real-time data ingestion from relational databases, however you can try scheduling Sqoop jobs with cron.