I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.
The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?
Sample Data:
Table 1:
hive> describe foo;
OK
orderno string
Time taken: 0.101 seconds, Fetched: 1 row(s)
Sample data for table1:
hive>select * from foo;
OK
1826203307
1826207803
1826179498
1826179657
Table 2:
hive> describe de_geo_ip_logs;
OK
id bigint
startorderno bigint
endorderno bigint
itemcode int
Time taken: 0.047 seconds, Fetched: 4 row(s)
Sample data for Table 2:
hive> select * from bar;
127698025 417880320 417880575 306
127698025 3038626048 3038626303 584
127698025 3038626304 3038626431 269
127698025 3038626560 3038626815 163
My Query:
SELECT b.itemcode
FROM foo a, bar b
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
In the very top of your Hive log output, it states "Warning: Shuffle Join JOIN[4][Tables a, b] in Stage 'Stage-1 Mapred' is a cross product."
EDIT:
A 'cross product' or Cartesian product is a join without conditions, which returns every row in the 'b' table, for every row in the 'a' table. So, if you take an example of 'a' is 5 rows, and 'b' is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely 'null' for one or the other tables.
Now, if you have a table 'a' of 20,000 rows and join it to another table 'b' of 500,000 rows, you are asking the SQL engine to return to you a data set 'a, b' of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.
So, if you drop the number of 'b' rows, you see you will get more benefit than the 'a' - in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time.
END EDIT
You're forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It's having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.
Try this query:
SELECT b.itemcode
FROM foo a JOIN bar b on <SomeKey>
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
But I'm having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.
Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.