Can Hive table automatically update when underlyin

2019-07-14 18:41发布

问题:

If I build a Hive table on top of some S3 (or HDFS) directory like so:

Create external table newtable (name string) row format delimited fields terminated by ',' stored as textfile location 's3a://location/subdir/';

When I add files to that S3 location, the Hive table doesn't automatically update. The new data is only included if I create a new Hive table on that location. Is there a way to build a Hive table (maybe using partitions) so that whenever new files are added to the underlying directory, the Hive table automatically shows that data (without having to recreate the Hive table)?

回答1:

On HDFS each file scanned each time table being queried as @Dudu Markovitz pointed. And files in HDFS are immediately consistent. On S3 files are immediately consistent after create and eventually consistent after delete or overwrite. When you add new files in s3 table folder they are immediately accessible when querying Hive table. There may be a problem with eventual consistency in S3 if you are rewriting files. If you rewrite files they are not immediately consistent, they are eventually consistent, see here: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel. There are few approaches to eliminate eventual consistency problem, such as writing each time newly created partition based on timestamp or dropping and creating table with new location based on timestamp or some runID. The idea is to create new files each time. Also have a look at this: https://github.com/andrewgaul/are-we-consistent-yet

Also there may be a problem with using statistics when querying table after adding files, see here: https://stackoverflow.com/a/39914232/2700344



回答2:

Everything @leftjoin says is correct, with one extra detail: s3 doesn't offer immediate consistency on listings. A new blob can be uploaded, HEAD/GET will return it but a list operation on the parent path may not see it. This means that Hive code which lists the directory may not see the data. Using unique names doesn't fix this, only using a consistent DB like Dynamo which is updated as files are added/removed. Even there, you have added a new thing to keep in sync...