Let's say my Rowkey has two parts (NUM1~NUM2).
I would like to do a count group by the first part of the Rowkey. Is there a way to do this in HBase?
I can always do it as a M/R job read all the rows, group, count...but I was wondering if there is a way to do it in HBase?
Option 1 :
you can use prefix filter.... some thing like below.
prefixfilter:
Same can be used with java client as well
Examples using Hbase shell :
based on your requirement...
NOTE : java hbase scan api also has same methods if you want to do it from java
Option2 :
This filter acts on row keys, but in a fuzzy manner. It needs a list of row keys that should be returned, plus an accompanying byte[] array that signifies the importance of each byte in the row key. The constructor is as such:
The fuzzyKeysData specifies the mentioned significance of a row key byte, by taking one of two values:
* Example: Partial Row Key Matching * A possible example is matching partial keys, but not from left to right, rather somewhere inside a compound key. Assuming a row key format of _, with fixed length parts, where is 4, is 2, is 4, and is 2 bytes long. The application now requests all users that performed certain action (encoded as 99) in January of any year. Then the pair for row key and fuzzy data would be the following:
row key "????99????_01", where the "?" is an arbitrary character, since it is ignored. fuzzy data = "\x01\x01\x01\x01\x00\x00\x00\x00\x01\x01\x01\x01\x00\x00\x00" In other words, the fuzzy data array instructs the filter to find all row keys matching "????99????_01", where the "?" will accept any character.
An advantage of this filter is that it can likely compute the next matching row key when it comes to an end of a matching one. It implements the getNextCellHint() method to help the servers in fast-forwarding to the next range of rows that might match. This speeds up scanning, especially when the skipped ranges are quite large. Example 4-12 uses the filter to grab specific rows from a test data set.
Example filtering by column prefix
The example code also adds a filtering column to the scan, just to keep the output short:
Adding rows to table... Results of scan:
The test code wiring adds 20 rows to the table, named row-01 to row-20. We want to retrieve all the rows that match the pattern row-?5, in other words all rows that end in the number 5. The output above confirms the correct result.
You can use RegexStringComparator in the hbase shell