When you create an external table in Hive (on Hadoop) with an Amazon S3 source location is the data transfered to the local Hadoop HDFS on:
- external table creation
- when quires (MR jobs) are run on the external table
- never (no data is ever transfered) and MR jobs read S3 data.
What are the costs incurred here for S3 reads? Is there a single cost for the transfer of data to HDFS or is there no data transfer costs but when the MapReduce job created by Hive runs on this external table the read costs are incurred.
An example external table definition would be:
CREATE EXTERNAL TABLE mydata (key STRING, value INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '='
LOCATION 's3n://mys3bucket/';