Reduction in the number of records using range joi

2019-09-16 18:28发布

问题:

Following the my question I have the following tables the first (Range) includes range of values and additional columns:

row  | From   |  To     | Country ....
-----|--------|---------|---------
1    | 1200   |   1500  |
2    | 2200   |   2700  |
3    | 1700   |   1900  |
4    | 2100   |   2150  |
... 

The From and To are bigint and are exclusive. The Range table includes 1.8M records. Additional table (Values) contains 2.7M records and looks like:

 row     | Value  | More columns....
 --------|--------|----------------
    1    | 1777   |    
    2    | 2122   |    
    3    | 1832   |    
    4    | 1340   |    
    ... 

I would like to create one table as followed:

row      | Value  | From   | To    | More columns....
 --------|--------|--------|-------|---
    1    | 1777   | 1700   | 1900  |
    2    | 2122   | 2100   | 2150  |   
    3    | 1832   | 1700   | 1900  |   
    4    | 1340   | 1200   | 1500  |   
    ... 

I used the left outer join in the following code:

set n=1000;

select      v.id
           ,v.val
           ,r.from_val
           ,r.to_val

from      val v
        left outer join    

 (select  r.*
                   ,floor(from_val/${hiveconf:n}) + pe.i    as match_val

            from    val_range r
                    lateral view    posexplode
                                    (
                                        split
                                        (
                                            space
                                            (
                                                cast
                                                (
                                                    floor(to_val/${hiveconf:n}) 
                                                  - floor(from_val/${hiveconf:n}) 

                                                    as int
                                                )
                                            )
                                           ,' '
                                        )
                                    ) pe as i,x
            ) r



            on      floor(v.val/${hiveconf:n})    =
                    r.match_val

where       v.val between r.from_val and r.to_val

order by    v.id       
;

However there is a vast reduction in the number of records of the new table ~31k records out of 2.7M. How can it be if I use the left outer join? How can I fix it?

回答1:

Assuming we have a v.id

set n=1000;

select      v.id
           ,r.from_val
           ,r.to_val

from                    val     v 

            left join  (select      v.id
                                   ,r.from_val
                                   ,r.to_val

                        from                val     v 

                                    join    (...)   r 

                                    on      floor(v.val/${hiveconf:n})    =
                                            r.match_val

                        where       v.val between r.from_val and r.to_val
                        ) r

            on          r.id    =
                        v.id

order by    v.id       

As for the OP request, here is the full query:

set n=1000;

select      v.id
           ,r.from_val
           ,r.to_val

from                    val     v 

            left join  (select      v.id
                                   ,r.from_val
                                   ,r.to_val

                        from                val     v 

                                    join   (select  r.*
                                                   ,floor(from_val/${hiveconf:n}) + pe.i    as match_val

                                            from    val_range r
                                                    lateral view    posexplode
                                                                    (
                                                                        split
                                                                        (
                                                                            space
                                                                            (
                                                                                cast
                                                                                (
                                                                                    floor(to_val/${hiveconf:n}) 
                                                                                  - floor(from_val/${hiveconf:n}) 

                                                                                    as int
                                                                                )
                                                                            )
                                                                           ,' '
                                                                        )
                                                                    ) pe as i,x
                                            ) r

                                    on      floor(v.val/${hiveconf:n})    =
                                            r.match_val

                        where       v.val between r.from_val and r.to_val
                        ) r

            on          r.id    =
                        v.id

order by    v.id