I have a hive external tables that mapped to some directory. This directory includes a several files.
I want to run query like find file name where there is a user "abc"
select file_name , usr from usrs_tables where usr = "abc"
But of course the data doesn't includes file name inside.
In MapReduce I can do it by
FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();
System.out.println("File name "+filename);
System.out.println("Directory and File name"+fileSplit.getPath().toString());
How can I do it in Hive?
Every table in Hive has two virtual columns. They are
INPUT__FILE__NAME
BLOCK__OFFSET__INSIDE__FILE
INPUT__FILE__NAME
gives the name of the file.BLOCK__OFFSET__INSIDE__FILE
is the current global file position. Suppose if we want to find the name of the file corresponding to each record in a file. We can use theINPUT__FILE__NAME
column. This feature is available from Hive versions above 0.8. A small example is given below.Query
This will give us the file name corresponding to each record. If you want to get the file names corresponding to a hive table, the below query will help you.
Yes, you can retrieve the file the record was found in using the virtual column named
INPUT__FILE__NAME
, for example:yields something like:
If necessary, use the provided string functions to trim the host and directories from the uri.
You can find the documentation on virtual columns here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
Updated link on virtual columns in hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns