HBase : get(…) vs scan and in-memory table

2019-04-06 02:23发布

I'm executing MR over HBase.

The business logic in the reducer heavily accesses two tables, say T1(40k rows) and T2(90k rows). Currently, I'm executing the following steps :

1.In the constructor of the reducer class, doing something like this :

HBaseCRUD hbaseCRUD = new HBaseCRUD();

HTableInterface t1= hbaseCRUD.getTable("T1",
                            "CF1", null, "C1", "C2");
HTableInterface t2= hbaseCRUD.getTable("T2",
                            "CF1", null, "C1", "C2");

In the reduce(...)

 String lowercase = ....;

/* Start : HBase code */
/*
 * TRY using get(...) on the table rather than a
 * Scan!
 */
Scan scan = new Scan();
scan.setStartRow(lowercase.getBytes());
scan.setStopRow(lowercase.getBytes());

/*scan will return a single row*/
ResultScanner resultScanner = t1.getScanner(scan);

for (Result result : resultScanner) {
 /*business logic*/
}

Though not sure if the above code is sensible in first place, I have a question - would a get(...) provide any performance benefit over the scan?

Get get = new Get(lowercase.getBytes());
Result getResult = t1.get(get);

Since T1 and T2 will be read-only(mostly), I think if kept in-memory, the performance will improve. As per HBase doc., I will have to re-create the tables T1 and T2. Please verify the correctness of my understanding :

public void createTables(String tableName, boolean readOnly,
            boolean blockCacheEnabled, boolean inMemory,
            String... columnFamilyNames) throws IOException {
        // TODO Auto-generated method stub

        HTableDescriptor tableDesc = new HTableDescriptor(tableName);
        /* not sure !!! */
        tableDesc.setReadOnly(readOnly);

        HColumnDescriptor columnFamily = null;

        if (!(columnFamilyNames == null || columnFamilyNames.length == 0)) {

            for (String columnFamilyName : columnFamilyNames) {

                columnFamily = new HColumnDescriptor(columnFamilyName);
                /*
                 * Start : Do these steps ensure that the column
                 * family(actually, the column data) is in-memory???
                 */
                columnFamily.setBlockCacheEnabled(blockCacheEnabled);
                columnFamily.setInMemory(inMemory);
                /*
                 * End : Do these steps ensure that the column family(actually,
                 * the column data) is in-memory???
                 */

                tableDesc.addFamily(columnFamily);
            }
        }

        hbaseAdmin.createTable(tableDesc);
        hbaseAdmin.close();
    }

Once done :

  1. How to verify that the columns are in-memory(of course, the describe statement and the browser reflect it) and accessed from there and not the disk?
  2. Is the from-memory or from-disk read transparent to the client? In simple words, do I need to change the HTable access code in my reducer class? If yes, what are the changes?

2条回答
你好瞎i
2楼-- · 2019-04-06 03:02

would a get(...) provide any performance benefit over the scan?

Get operates directly on a particular row identified by the rowkey passed as a parameter to the the Get instance. While Scan operates on all the rows, if you haven't used range query by providing start and end rowkeys to your Scan instance. Clearly it is more efficient if you know it beforehand which row to operate on. You can directly go there and perform the desired operation.

How to verify that the columns are in-memory(of course, the describe statement and the browser reflect it) and accessed from there and not the disk?

You can use isInMemory() method provided by HColumnDescriptor to verify if a particular CF is in-memory or not. But, you cannot find out that the entire table is in memory and whether fetch is happening from disk or the memory. Although in-memory blocks have the highest priority, but it is not 100% sure that everything is in-memory all the time. One important thing here is that data is persisted to disk even in case of in-memory CF.

Is the from-memory or from-disk read transparent to the client? In simple words, do I need to change the HTable access code in my reducer class? If yes, what are the changes?

Yes. It is totally transparent. You don't have to do anything extra.

查看更多
再贱就再见
3楼-- · 2019-04-06 03:04
  1. There is no substantial difference between these as far as implementation is concerned. They both are identical to client.
查看更多
登录 后发表回答