I am wondering whether HBase is using column based storage or row based storage?
- I read some technical documents and mentioned advantages of HBase is using column based storage to store similar data together to foster compression. So it means same columns of different rows are stored together;
- But I also learned HBase is a sorted key-value map. It uses key to address all related columns for that key (row), so it seems to be a row based storage?
It is appreciated if anyone could clarify my confusions.
thanks in advance, George
In addition to Ian's excellent answer, I would opine that HBase is both a row-based key-value, as well as a column-based key-value store (if you know the row-key).
If you prefer to think of it in terms of data structures, here's what a simple HBase table could look like:
Of course, you can also store even more complicated data-structures in it, as you can see from Ian's presentation.
George, here's a presentation I gave about understanding HBase schemas from HBaseCon 2012:
http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-hbasecon-2012.html
In short, each row in HBase is actually a key/value map, where you can have any number of columns (keys), each of which has a value. (And, technically, each of which can have multiple values with different timestamps).
Additionally, "column families" allow you to host multiple key/value maps in the same row, in different physical (on disk) files. This helps optimize in situations where you have sets of values that are usually accessed disjointly from other sets (so you have less stuff to read off disk). The trade off is that, of course, it's more work to read all the values in a row if you separate columns into two column families, because there are 2x the number of disk accesses needed.
Unlike more standard "column oriented" databases, I've never heard of anyone creating an HBase table that had a column family for every logical column. There's overhead associated with column families, and the general advice is usually to have no more than 3 or 4 of them. Column families are "design time" information, meaning you must specify them at the time you create (or alter) the table.
Generally, I find column families to be an advanced design option that you'd only use once you have a deep understanding of HBase's architecture and can show that it would be a net benefit.
So overall, while it's true that HBase can act in a "column oriented" way, it's not the default nor the most common design pattern in HBase. It's better to think of it as a row store with key/value maps.