How to make a table that is automatically updated

2019-06-11 14:27发布

问题:

I have created an external table that in Hive that uses data from a Parquet store in HDFS.

When the data in HDFS is deleted, there is no data in the table. When the data is inserted again in the same spot in HDFS, the table does not get updated to contain the new data. If I insert new records into the existing table that contains data, no new data is shown when I run my Hive queries.

How I create the table in Hive:

CREATE EXTERNAL TABLE nodes (id string) STORED AS PARQUET LOCATION "/hdfs/nodes";

The relevant error:

Error: java.io.FileNotFoundException: File does not exist: /hdfs/nodes/part-r-00038-2149d17d-f890-48bc-a9dd-5ea07b0ec590.gz.parquet

I have seen several posts that explain that external tables should have the most up to date data in them, such as here. However, this is not the case for me, and I don't know what is happening.

I inserted the same data into the database again, and queried the table. It contained the same amount of data as before. I then created an identical table with a different name. It had twice as much data in it, which was the right amount.

The issue might be with the metastore database. I am using PostgreSQL instead of Derby for the the database.

Relevant information:

  • Hive 0.13.0
  • Spark Streaming 1.4.1
  • PostgreSQL 9.3
  • CentOS 7

EDIT: After examining the Parquet files, I found that the part files have seemingly incompatible file names. -rw-r--r-- 3 hdfs hdfs 18702811 2015-08-27 08:22 /hdfs/nodes/part-r-00000-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet -rw-r--r-- 3 hdfs hdfs 18703029 2015-08-26 15:43 /hdfs/nodes/part-r-00000-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet -rw-r--r-- 3 hdfs hdfs 18724320 2015-08-27 08:22 /hdfs/nodes/part-r-00001-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet -rw-r--r-- 3 hdfs hdfs 18723575 2015-08-26 15:43 /hdfs/nodes/part-r-00001-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet

These files are the files that causes Hive to error when it can't find it in the error described above. This means that the external table is not acting dynamically, accepting any files in the directory (if you call it that in HDFS), but instead is probably just keeping track of the list of parquet files inside the directory when it was created.

Sample Spark code: nodes.foreachRDD(rdd => { if (!rdd.isEmpty()) sqlContext.createDataFrame(rdd.map( n => Row(n.stuff), ParquetStore.nodeSchema) .write.mode(SaveMode.Append).parquet(node_name) })

Where the nodeSchema is the schema and node_name is "/hdfs/nodes"

See my other question about getting Hive external tables to detect new files.

回答1:

In order to get Hive to update its tables, I had to resort to using the partitioning feature of Hive. By creating a new partition during each Spark run, I create a series of directories internal to the /hdfs/nodes directory like this:

/hdfs/nodes/timestamp=<a-timestamp>/<parquet-files>
/hdfs/nodes/timestamp=<a-different-timestamp>/<parquet-files>

Then, after each Spark job completes, I run the Hive command MSCK REPAIR TABLE nodes using a HiveContext in my Spark job, which finds new partitions and updates the table.

I realize this isn't automatic, but it at least works.



回答2:

Ok, so probably you need to encapsulate the file in a folder. Hive external table must be mapped on a folder where there could be more than one file.

try to write the file to: /path/to/hdfs/nodes/file and then map the external table to /path/to/hdfs/nodes

so in the folder nodes you will have only the parquet file and it should works



标签: hive hdfs hiveql