Setup:

I have an HBase table, with 100M+ rows and 1 Million+ columns. Every row has data for only 2 to 5 columns. There is in just 1 Column Family.

Problem:

I want to find out all the distinct qualifiers (columns) in this column family. Is there a quick way to do that?

I can think of about scanning the whole table, then getting familyMap for each row, get qualifier and add it to a Set<>. But that would be awfully slow, as there are 100M+ rows.

Can we do any better?

标签： hadoop hbase

3条回答

傲

2楼-- · 2019-04-09 06:33

You can use a mapreduce for this. In this case you don't need to install a custom libs for hbase as in case for coprocessor. Below a code for creating a mapreduce task.

Job setup

    Job job = Job.getInstance(config);
    job.setJobName("Distinct columns");

    Scan scan = new Scan();
    scan.setBatch(500);
    scan.addFamily(YOU_COLUMN_FAMILY_NAME);
    scan.setFilter(new KeyOnlyFilter()); //scan only key part of KeyValue (raw, column family, column)
    scan.setCacheBlocks(false);  // don't set to true for MR jobs


    TableMapReduceUtil.initTableMapperJob(
            YOU_TABLE_NAME,
            scan,          
            OnlyColumnNameMapper.class,   // mapper
            Text.class,             // mapper output key
            Text.class,             // mapper output value
            job);

    job.setNumReduceTasks(1);
    job.setReducerClass(OnlyColumnNameReducer.class);
    job.setReducerClass(OnlyColumnNameReducer.class);

Mapper

 public class OnlyColumnNameMapper extends TableMapper<Text, Text> {
    @Override
    protected void map(ImmutableBytesWritable key, Result value, final Context context) throws IOException, InterruptedException {
       CellScanner cellScanner = value.cellScanner();
       while (cellScanner.advance()) {

          Cell cell = cellScanner.current();
          byte[] q = Bytes.copy(cell.getQualifierArray(),
                                cell.getQualifierOffset(),
                                cell.getQualifierLength());

          context.write(new Text(q),new Text());  

       }
 }

}

Reducer

public class OnlyColumnNameReducer extends Reducer<Text, Text, Text, Text> {

    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {    
            context.write(new Text(key), new Text());    
    }
}

0人赞添加讨论(0) 举报

smile是对你的礼貌

3楼-- · 2019-04-09 06:35

HBase Coprocessors can be used for this scenario. You can write custom EndPoint implementation which works like Stored Procedures in RDBMS. It executes your code on server side and get distinct columns for each region. On client you can get the distinct columns across all regions.

Performance Benefit: All columns are not transferred to the client which results in reduced network calls.

0人赞添加讨论(0) 举报

Root（大扎）

4楼-- · 2019-04-09 06:47

HBase can be visualised as a distributed NavigableMap<byte[], NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>>>

There is no "metadata" (say something centrally stored in the master node) about the list of all qualifiers that's available in all region servers.

So if you have a one-time use-case, the only way for you would be to scan through the entire table and add the qualifier names in a Set<>, like you mentioned.

If this is a repeat use-case (plus if you have the discretion to add components to your tech stack), you may want to consider adding Redis. Set of qualifiers can be maintained in a distributed fashion using a Redis Set.

0人赞添加讨论(0) 举报

Can we get all the column names from an HBase tabl

Setup:

Problem:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间