I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly.
Where does Hive store its files in HDFS?
I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly.
Where does Hive store its files in HDFS?
The location they are stored on the HDFS is fairly easy to figure out once you know where to look. :)
If you go to http://NAMENODE_MACHINE_NAME:50070/
in your browser it should take you to a page with a Browse the filesystem
link.
In the $HIVE_HOME/conf
directory there is the hive-default.xml
and/or hive-site.xml
which has the hive.metastore.warehouse.dir
property. That value is where you will want to navigate to after clicking the Browse the filesystem
link.
In mine, it's /usr/hive/warehouse
. Once I navigate to that location, I see the names of my tables. Clicking on a table name (which is just a folder) will then expose the partitions of the table. In my case, I currently only have it partitioned on date
. When I click on the folder at this level, I will then see files (more partitioning will have more levels). These files are where the data is actually stored on the HDFS.
I have not attempted to access these files directly, I'm assuming it can be done. I would take GREAT care if you are thinking about editing them. :)
For me - I'd figure out a way to do what I need to without direct access to the Hive data on the disk. If you need access to raw data, you can use a Hive query and output the result to a file. These will have the exact same structure (divider between columns, ect) as the files on the HDFS
. I do queries like this all the time and convert them to CSVs.
The section about how to write data from queries to disk is https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
UPDATE
Since Hadoop 3.0.0 - Alpha 1 there is a change in the default port numbers. NAMENODE_MACHINE_NAME:50070 changes to NAMENODE_MACHINE_NAME:9870. Use the latter if you are running on Hadoop 3.x. The full list of port changes are described in HDFS-9427
Hive tables may not necessarily be stored in a warehouse (since you can create tables located anywhere on the HDFS).
You should use DESCRIBE FORMATTED <table_name>
command.
hive -S -e "describe formatted <table_name> ;" | grep 'Location' | awk '{ print $NF }'
Please note that partitions may be stored in different places and to get the location of the alpha=foo/beta=bar
partition you'd have to add partition(alpha='foo',beta='bar')
after <table_name>
.
In Hive terminal type:
hive> set hive.metastore.warehouse.dir;
(it will print the path)
It's also very possible that typing show create table <table_name>
in the hive cli will give you the exact location of your hive table.
describe formatted <table_name>;
inside hive shell.
Notice the "Location" value that shows the location of the table.
Summarize few points posted earlier, in hive-site.xml, property hive.metastore.warehouse.dir specifies where the files located under hadoop HDFS
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
To view files, use this command:
hadoop fs -ls /user/hive/warehouse
or
http://localhost:50070
Utilities > Browse the file system
or
http://localhost:50070/explorer.html#/
tested under hadoop-2.7.3, hive-2.1.1
If you look at the hive-site.xml file you will see something like this
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/usr/hive/warehouse </value>
<description>location of the warehouse directory</description>
</property>
/usr/hive/warehouse is the default location for all managed tables. External tables may be stored at a different location.
describe formatted <table_name>
is the hive shell command which can be use more generally to find the location of data pertaining to a hive table.
In Hive, tables are actually stored in a few places. Specifically, if you use partitions (which you should, if your tables are very large or growing) then each partition can have its own storage.
To show the default location where table data or partitions will be created if you create them through default HIVE commands: (insert overwrite ... partition ...
and such):
describe formatted dbname.tablename
To show the actual location of a particular partition within a HIVE table, instead do this:
describe formatted dbname.tablename partition (name=value)
If you look in your filesystem where a table "should" live, and you find no files there, it's very likely that the table is created (usually incrementally) by creating a new partition and pointing that partition at some other location. This is a great way of building tables from things like daily imports from third parties and such, which avoids having to copy the files around or storing them more than once in different places.
Another way to check where a specific table is stored would be execute this query on the hive interactive interface:
show create table table_name;
where table_name is the name of the subject table.
An example for the above query on 'customers' table would be something like this:
CREATE TABLE `customers`(
`id` string,
`name` string)
COMMENT 'Imported by sqoop on 2016/03/01 13:01:49'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://quickstart.cloudera:8020/user/hive/warehouse/
sqoop_workspace.db/customers'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='4',
'totalSize'='77',
'transient_lastDdlTime'='1456866115')
LOCATION in the example above is where you should focus on. That is your hdfs location for hive warehouse.
Don't forget to like if you like this solution. Cheers!
In sandbox , you need to go for /apps/hive/warehouse/ and normal cluster /user/hive/warehouse