The common query building pattern in HiveQL (and SQL in general) is to either select all columns (SELECT *
) or an explicitly-specified set of columns (SELECT A, B, C
). SQL has no built-in mechanism for selecting all but a specified set of columns.
There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT *
then ALTER TABLE DROP
some of its columns would wreak havoc in a big data environment.)
Ignoring the ideological discussion about whether it is a good idea to select all but some columns, this question is about the possible ways to extend Hive with this capability.
Prior to Hive 0.13.0 SELECT could take regular-expression-based columns, e.g., property_.*
inside a backtick-quoted string. @invoketheshell's answer below refers to this capability but it comes at a cost, which is that, when this capability is on, Hive cannot accept columns with non-standard characters in them, e.g., $foo
or x/y
. That's why the Hive developers turned this behavior off by default in 0.13.0. I am looking for a generic solution that works for any column name.
A generic table-generating UDF (UDTF) could certainly do this because it can manipulate the schema. Since we are not going to generate new rows, is there a way to solve this problem using a simple row-based UDF?
This seems like a common problem with many posts around the Web showing how to solve it for various databases yet I haven't been able to find a solution for Hive. Is there code somewhere that does this?