Is there a way in HBase to COUNT rows matching row

2019-07-17 16:11发布

Let's say my Rowkey has two parts (NUM1~NUM2).

I would like to do a count group by the first part of the Rowkey. Is there a way to do this in HBase?

I can always do it as a M/R job read all the rows, group, count...but I was wondering if there is a way to do it in HBase?

标签: hadoop hbase
2条回答
Lonely孤独者°
2楼-- · 2019-07-17 16:50

Option 1 :

you can use prefix filter.... some thing like below.

prefixfilter:

This filter takes one argument a prefix of a row key. It returns only those key-values present in a row that starts with the specified row prefix

Syntax

PrefixFilter (<row_prefix>)

Same can be used with java client as well

Examples using Hbase shell :

scan 'yourtable', {FILTER => "PrefixFilter('12345|abc|50|2016-05-05')"}

scan 'yourtable', {STARTROW=>'12345' FILTER => "PrefixFilter('2016-05-05 08:10:10')"}

based on your requirement...

NOTE : java hbase scan api also has same methods if you want to do it from java

Option2 :

FuzzyRowFilter(see hbase-the-definitive) This is really useful in our case We have used bulk clients like map-reduce as well as standalone hbase clients

This filter acts on row keys, but in a fuzzy manner. It needs a list of row keys that should be returned, plus an accompanying byte[] array that signifies the importance of each byte in the row key. The constructor is as such:

FuzzyRowFilter(List<Pair<byte[], byte[]>> fuzzyKeysData)

The fuzzyKeysData specifies the mentioned significance of a row key byte, by taking one of two values:

0 Indicates that the byte at the same position in the row key must match as-is. 1 Means that the corresponding row key byte does not matter and is always accepted.

* Example: Partial Row Key Matching * A possible example is matching partial keys, but not from left to right, rather somewhere inside a compound key. Assuming a row key format of _, with fixed length parts, where is 4, is 2, is 4, and is 2 bytes long. The application now requests all users that performed certain action (encoded as 99) in January of any year. Then the pair for row key and fuzzy data would be the following:

row key "????99????_01", where the "?" is an arbitrary character, since it is ignored. fuzzy data = "\x01\x01\x01\x01\x00\x00\x00\x00\x01\x01\x01\x01\x00\x00\x00" In other words, the fuzzy data array instructs the filter to find all row keys matching "????99????_01", where the "?" will accept any character.

An advantage of this filter is that it can likely compute the next matching row key when it comes to an end of a matching one. It implements the getNextCellHint() method to help the servers in fast-forwarding to the next range of rows that might match. This speeds up scanning, especially when the skipped ranges are quite large. Example 4-12 uses the filter to grab specific rows from a test data set.

Example filtering by column prefix

List<Pair<byte[], byte[]>> keys = new ArrayList<Pair<byte[], byte[]>>();
keys.add(new Pair<byte[], byte[]>(
  Bytes.toBytes("row-?5"), new byte[] { 0, 0, 0, 0, 1, 0 }));
Filter filter = new FuzzyRowFilter(keys);

Scan scan = new Scan()
  .addColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("col-5"))
  .setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
  System.out.println(result);
}
scanner.close();

The example code also adds a filtering column to the scan, just to keep the output short:

Adding rows to table... Results of scan:

keyvalues={row-05/colfam1:col-01/1/Put/vlen=9/seqid=0,
           row-05/colfam1:col-02/2/Put/vlen=9/seqid=0,
           ...
           row-05/colfam1:col-09/9/Put/vlen=9/seqid=0,
           row-05/colfam1:col-10/10/Put/vlen=9/seqid=0}
keyvalues={row-15/colfam1:col-01/1/Put/vlen=9/seqid=0,
           row-15/colfam1:col-02/2/Put/vlen=9/seqid=0,
           ...
           row-15/colfam1:col-09/9/Put/vlen=9/seqid=0,
           row-15/colfam1:col-10/10/Put/vlen=9/seqid=0}

The test code wiring adds 20 rows to the table, named row-01 to row-20. We want to retrieve all the rows that match the pattern row-?5, in other words all rows that end in the number 5. The output above confirms the correct result.

查看更多
贪生不怕死
3楼-- · 2019-07-17 17:06

You can use RegexStringComparator in the hbase shell

hbase(main):003:0> import org.apache.hadoop.hbase.filter.RegexStringComparator
hbase(main):006:0> scan 'test', {FILTER => org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),RegexStringComparator.new("NUM1*"))}
查看更多
登录 后发表回答