So if there is a table in the database shown as below:
Key2 DateTimeAge
AAA1 XXX XXX XXX
AAA2 XXX XXX XXX
AAA3 XXX XXX XXX
AAA4 XXX XXX XXX
AAA5 XXX XXX XXX
AAA6 XXX XXX XXX
AAA7 XXX XXX XXX
AAA8 XXX XXX XXX
BBB1 XXX XXX XXX
BBB2 XXX XXX XXX
BBB3 XXX XXX XXX
BBB4 XXX XXX XXX
BBB5 XXX XXX XXX
CCC1 XXX XXX XXX
CCC2 XXX XXX XXX
CCC3 XXX XXX XXX
CCC4 XXX XXX XXX
CCC5 XXX XXX XXX
CCC6 XXX XXX XXX
CCC7 XXX XXX XXX
DDD1 XXX XXX XXX
DDD2 XXX XXX XXX
DDD3 XXX XXX XXX
DDD4 XXX XXX XXX
DDD5 XXX XXX XXX
DDD6 XXX XXX XXX
DDD7 XXX XXX XXX
I have a 2nd table, given as
1 AAA
2 DDD
3 CCC
Since AAA,DDD and CCC are in table2, I want to scan all rows corresponding to these 3 values from table1, i.e, I want to get AAA1-AAA8, DDD1 -DDD7 and CCC1-CCC7.
The table1 is a Hadoop database table, and has millions of rows. I only want to scan it for rows which I get from table1.
Could anyone help me with an efficent way of doing this? I have table1 as an RDD, and table2 is in HBase.
The difficult part is actually to setup the HBase connector either from Hortonworks or from Huawei.
But anyway I think you are asking about the query itself, so I have quickly built a toy example using Hive (i.e. creating the HBase table using the shell and then adding a
create external table
in Hive).Then I create a SQL context using the Hive context.
The full toy table has 3 rows:
and to access a subset of the HBase
rowkeys
:For performance you should definitively go for one of the HBase connectors, but once you have it (at least for Hortonworks') the query should be the same.