High-Level Question:
Does HBase impose a maximum size per row which is common to all distributions (and thus not an artifact of implementation), either in terms of bytes-stored or in terms of number of cells?
If so:
What is the limit?
What is the reason the limit exists?
Where is the limit documented?
If not:
Is documentation (or results of a test) available demonstrating the ability of HBase to handle rows in excess of 2GB? 4GB?
Is there a practical or "best practice" maximum under which HBase API users should keep row sizes in order to avoid severe performance degradation? If so, what kind of performance degradation can occur if that guidance is discarded?
In either case:
- Does the answer depend on the HBase version in question?
Background:
- At least one implementation of the HBase API does appear to impose a limit; MapR Tables, which uses MapR's proprietary MapR-FS as the storage layer underlying the tables, appears to impose a hard limit of 2GB per row and a configurable soft limit which defaults to 32MB. Do other popular implementations of the HBase API also impose such a restriction?
- This Quora response from HBase committer Todd Lipcon in 2011 suggests the absence of a limit in terms of number of cells. However, it also indicates that "the unit of load balancing and distribution is the region, and a row will never be split across regions". Does the requirement that a row exist within a single region impose either a hard limit on the row size, or a practical limit, past which performance degradation becomes severe?