SingleColumnValueFilter not returning proper numbe

In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.

I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.

Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.

Our MapReduce job features a single mapper:

public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{

    public String crawlIdentifier;

    // counters
    private static enum CountRows {
        ROWS_WITH_MATCHED_CRAWL_IDENTIFIER
    }

    @Override
    public void setup(Context context){
        Configuration configuration=context.getConfiguration();
        crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);

    }

    @Override
    public void map(ImmutableBytesWritable legacykey, Result row, Context context){
        String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN);
        if (StringUtils.equals(crawlIdentifier, rowIdentifier)){
            context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l);
        }
    }
}

The filter setup is like this:

String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
if (StringUtils.isBlank(crawlIdentifier)){
    throw new IllegalArgumentException("Crawl Identifier not set.");
}

// build an HBase scanner
Scan scan=new Scan();
SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(),
    HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(),
    CompareOp.EQUAL,
    Bytes.toBytes(crawlIdentifier));
filter.setFilterIfMissing(true);
scan.setFilter(filter);

Are we using the wrong filter, or have we configured it wrong?

EDIT: we're looking at manually adding all the column families as per https://issues.apache.org/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.

标签： filter mapreduce hbase

1条回答

霸刀☆藐视天下

2楼-- · 2019-09-01 01:25

The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.

Bytes.toBytes(crawlIdentifier)

[1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String) [2] http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#getBytes()

0人赞添加讨论(0) 举报

SingleColumnValueFilter not returning proper numbe

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间