I have two tables. one includes 1 million records, the other includes 20 million records.
table 1 value (1, 1) (2, 2) (3, 3) (4, 4) (5, 4) .... table 2 value (55, 11) (33, 22) (44, 66) (22, 11) (11, 33) ....
I need to use the value in tables 1 to multiply by the value in table 2, get the rank of the result, and get top 5 in the rank. their result would be like:
value from table 1, top 5 for each value in table 1 (1, 1), 1*44 + 1*66 = 110 (1, 1), 1*55 + 1*11 = 66 (1, 1), 1*33 + 1*22 = 55 (1, 1), 1*11 + 1*33 = 44 (1, 1), 1*22 + 1* 11 = 33 .....
I tried to use cross join in hive. but I always get a failure due to the table is too large.
select top 5 from table 2 first, then do a cross join with first table. This will be the same as cross join two tables and taking top5 after cross join, but the number of rows joined in the first case will be much less. Cross join with small 5 rows dataset will be transformed to map-join and executed as fast as table1 full scan.
Look at the below demo. Cross join was transformed to map join. Note
"Map Join Operator"
in the plan and this warning:"Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Map 1' is a cross product"
:Just replace stacks in my demo with your tables.