I am learning hive and read an article about when to use HIVE external table and mentioned the statement below.
To query data stored in external system such as amazon s3 - Avoid brining in that data into HDFS
Can anyone elaborate above statement. "Avoid brining in that data into HDFS"? Load data local command will help to load local file into HDFS and HIVE is applying the format on the top.
Is it possible to access the data which is out of HDFS?
HIve can read data on any Hadoop Compatible filesystem, not only HDFS.
With the example of S3, you can create an external table with a location of
s3a://bucket/path
, there's no need to bring it to HDFS unless you really needed the speed of reading HDFS compared to S3. However, to persist a dataset in an ephemeral cloud cluster, results should be written back to whatever long-term storage is provided.It is possible. You can try this yourself. On CDH, I have a file
extn\t.txt
I can now create an external table to access this file as follows
Describe table
Select
Describe formatted
Load data is different. Please check this External Table vs Load Data