Nested select in hiveQL

2019-09-14 18:18发布

问题:

In one of my use case, i have two tables namely flow and conf. The flow table contains list of all flight data. It has columns creationdate,datafilename,aircraftid. The conf table contains configuration information. It has columns configdate, aircraftid, configurationame. There are multiple versions of configurations created for one aircraft type. So, when we process a datafilename, we need to identify the aircraftid from the flow table, and pick up the configuration from conf table that was created just before the datafilename was created. So, i tried this,

FROM (
SELECT  
F_FILE_CREATION_DATE,    
F_FILE_ARCHIVED_RELATIVE_PATH,   
F_FILE_ARCHIVED_NAME,   
K_AIRCRAFT   
from T_FLOW f )x left join 
(
  select c.config_date, c.aircraft_id, c.configurationfrom t_conf c
) y on y.aircraft_id = x.K_AIRCRAFT 
select    
x.F_FILE_CREATION_DATE,  
x.F_FILE_ARCHIVED_RELATIVE_PATH,   
x.F_FILE_ARCHIVED_NAME,   
x.K_AIRCRAFT,   
y.config_date,
y.aircraft_id,
y.configuration;

This picks up all the configurations created for the aircraft which is obvious as there is no condition to check conf.config_date < flow.f_file_creation_date. I tried to include this condition like this,

 FROM (
SELECT  
F_FILE_CREATION_DATE,    
F_FILE_ARCHIVED_RELATIVE_PATH,   
F_FILE_ARCHIVED_NAME,   
K_AIRCRAFT   
from T_FLOW f )x join 
(
  select c.config_date, c.aircraft_id, c.FILEFILTER from t_conf c
) y on y.aircraft_id = x.K_AIRCRAFT  where y.config_date < x.f_file_creation_date    
select    
x.F_FILE_CREATION_DATE,  
x.F_FILE_ARCHIVED_RELATIVE_PATH,   
x.F_FILE_ARCHIVED_NAME,   
x.K_AIRCRAFT,   
y.config_date,
y.aircraft_id,
y.filefilter;

This time failed with the error

 required (...)+ loop did not match anything at input 'where' in statement

Can someone give me a hint or two where i am going wrong and on how to fix this?

回答1:

select  f.f_file_creation_date
       ,f.f_file_archived_relative_path   
       ,f.f_file_archived_name
       ,f.k_aircraft

       ,c.config_date
       ,c.aircraft_id
       ,c.filefilter

from            t_flow  as f

        join   (select  config_date
                       ,aircraft_id
                       ,filefilter

                       ,lead (config_date,1,date '3000-01-01') over 
                        (
                            partition by    aircraft_id
                            order by        config_date  
                        )   as next_config_date                        

                from    t_conf
                ) c

        on      c.aircraft_id = 
                f.k_aircraft

where   f.f_file_creation_date  >=  c.config_date 
    and f.f_file_creation_date  <   c.next_config_date   

Please read carefully

Posting a question

When you post a data related question -

  1. Supply a data sample: source data + required results.
    It is going to be more clear than any explanation you give. It will also supply a common background for further discussions and a way for you and others to verify the correctness of the given solutions.
  2. Supply the size properties (records/volume) of the tables.
    It is important for performance considerations ans might impact the given solution.

SQL

  1. Hive currently does not support any JOIN condition type other than equijoin (e.g. t1.X = t2.X and t1.Y = t2.Y). This is why you get an error.
    If you are doing an inner join (and not outer join) then you can move the non-equijoin conditions to the WHERE clause.
  2. Stick to ISO SQL standard. There is a conventional order for SQL clauses: SELECT-FROM-WHERE...
    You gain nothing from esoteric syntax except for esoteric error messages.
  3. There is no reason what so ever to use sub-queries in order to narrow the columns list.
    Just to make it perfectly clear - There isn't any performance gain doing that. More than that, if it would have work as you assume (and it does not) the performance would have been worse, not better.


回答2:

I can't reproduce your error. I guess your query is valid. What version do you use for Hive ? I tested this query with hive 2.1.1.

DROP TABLE IF EXISTS t_flow;
CREATE TABLE IF NOT EXISTS t_flow (
    f_file_creation_date            DATE
    , f_file_archived_relative_path STRING
    , f_file_archived_name          STRING
    , k_aircraft                    STRING
);

-- Conf table contains configuration information.
-- It has columns configdate, aircraftid, configurationame
DROP TABLE IF EXISTS t_conf;
CREATE TABLE IF NOT EXISTS t_conf (
    config_date   DATE
    , aircraft_id STRING
    , filefilter  STRING
);

SELECT
    x.f_file_creation_date,
    x.f_file_archived_relative_path,
    x.f_file_archived_name,
    x.k_aircraft,
    y.config_date,
    y.aircraft_id,
    y.filefilter
FROM
    (SELECT
        f_file_creation_date,
        f_file_archived_relative_path,
        f_file_archived_name,
        k_aircraft
    FROM t_flow f) x
    JOIN
    (SELECT
    c.config_date,
    c.aircraft_id,
    c.filefilter
    FROM t_conf c) y on y.aircraft_id = x.k_aircraft  where y.config_date < x.f_file_creation_date;