I am learning hive and read an article about when to use HIVE external table and mentioned the statement below.
To query data stored in external system such as amazon s3
- Avoid brining in that data into HDFS
Can anyone elaborate above statement. "Avoid brining in that data into HDFS"? Load data local command will help to load local file into HDFS and HIVE is applying the format on the top.
Is it possible to access the data which is out of HDFS?
is it possible to access the data which is out of HDFS?
HIve can read data on any Hadoop Compatible filesystem, not only HDFS.
Can someone elaborate above statement. "Avoid brining in that data into HDFS "?
With the example of S3, you can create an external table with a location of s3a://bucket/path
, there's no need to bring it to HDFS unless you really needed the speed of reading HDFS compared to S3. However, to persist a dataset in an ephemeral cloud cluster, results should be written back to whatever long-term storage is provided.
It is possible. You can try this yourself. On CDH, I have a file extn\t.txt
[cloudera@quickstart ~]$ pwd
/home/cloudera
[cloudera@quickstart ~]$ cat extn/t.txt
something
[cloudera@quickstart ~]$
I can now create an external table to access this file as follows
create external table tbl(line string)
location 'file:///home/cloudera/extn'
Describe table
INFO : OK
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| line | string | |
+-----------+------------+----------+--+
1 row selected (0.152 seconds)
0: jdbc:hive2://localhost:10000>
Select
INFO : OK
+------------+--+
| tbl.line |
+------------+--+
| something |
+------------+--+
1 row selected (0.134 seconds)
0: jdbc:hive2://localhost:10000>
Describe formatted
+-------------------------------+----------------------------------------------------+-----------------------+--+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+-----------------------+--+
| # col_name | data_type | comment |
| | NULL | NULL |
| line | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | cloudera | NULL |
| CreateTime: | Tue Feb 20 12:49:25 PST 2018 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | file:/home/cloudera/extn | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | false |
| | EXTERNAL | TRUE |
| | numFiles | 0 |
| | numRows | -1 |
| | rawDataSize | -1 |
| | totalSize | 0 |
| | transient_lastDdlTime | 1519159765 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+-----------------------+
Load data is different. Please check this External Table vs Load Data