I'm investigating the different types of NoSQL database types and I'm trying to wrap my head around the data model of column-family stores, such as Bigtable, HBase and Cassandra.
First model
Some people describe a column family as a collection of rows, where each row contains columns [1], [2]. An example of this model (column families are uppercased):
{
"USER":
{
"codinghorror": { "name": "Jeff", "blog": "http://codinghorror.com/" },
"jonskeet": { "name": "Jon Skeet", "email": "jskeet@site.com" }
},
"BOOKMARK":
{
"codinghorror":
{
"http://codinghorror.com/": "My awesome blog",
"http://unicorns.com/": "Weaponized ponies"
},
"jonskeet":
{
"http://msmvps.com/blogs/jon_skeet/": "Coding Blog",
"http://manning.com/skeet2/": "C# in Depth, Second Edition"
}
}
}
Second model
Other sites describe a column family as a group of related columns within a row [3], [4]. Data from the previous example, modeled in this fashion:
{
"codinghorror":
{
"USER": { "name": "Jeff", "blog": "http://codinghorror.com/" },
"BOOKMARK":
{
"http://codinghorror.com/": "My awesome blog",
"http://unicorns.com/": "Weaponized ponies"
}
},
"jonskeet":
{
"USER": { "name": "Jon Skeet", "email": "jskeet@site.com" },
"BOOKMARK":
{
"http://msmvps.com/blogs/jon_skeet/": "Coding Blog",
"http://manning.com/skeet2/": "C# in Depth, Second Edition"
}
}
}
A possible rationale behind the first model is that not all column families have a relation like USER
and BOOKMARK
do. This implies that not all column families contain identical keys. Placing the column families at the outer level feels more natural from this point of view.
The name 'column family' implies a group of columns. This is exactly how column families are presented in the second model.
Both models are valid representations of the data. I realize that these representations are solely for communicating the data towards humans; applications don't 'think' of data in such a way.
Question
What is the 'standard' definition of a column family? Is it a collection of rows, or a group of related columns within a row?
I have to write a paper on the subject, so I'm also interested in how people usually explain the 'column family' concept to other people. Both of these models seem to contradict each other. I'd like to use the 'correct' or generally accepted model to describe column-family stores.
Update
I have settled with the second model for explaining the data model in my paper. I'm still interested in how you explain the data model of column-family stores to other people.
Both models you've described are the same.
Column family is:
Conceptually it becomes:
Think of it as a Map of Map of Key/Value pairs.
The above is an example of a column family. If you were to tabulate it, you'd get a Table called UserProfile which looks like:
The confusing part is that there's not really a column or a row as we're used to thinking of them. There's a bunch of "column families" which are queried by name (the key). Those families contain a bunch of sets of key/value pairs, which are also queried by name (the row key), and finally, each value in the set can be looked up by name also (the column key).
If you needed a tabular reference point, "column families" would be your "tables". Each "set of k/v pair" inside them would be your "rows". Each "pair of the set" would be the "column names and their values".
Internally, the data inside each column familly is going to be stored together, and it'll be stored such that the rows are one after the other, and in each row, the columns are one after the other. So you get
row1 -> col1/val1, col2/val2, ... , row2 -> col1/val1 ... , ... -> ...
. So in that sense, the data is stored much more like a row-store, and less so like a column-store.To finish, the choice of words here is just unfortunate and misleading. Columns in Column Families should have been called Attributes. Rows should have been called Attribute Sets. Column families should have been called Attributes families. The relation to classic tabular vocabulary is weak and misleading, since it's actually pretty different.
As per my understanding, Cassandra ColumnFamily is not a collection of rows, rather it is cluster of columns. Column are clustered together based on clustering key. for example, lets consider below columnfamily:
Fetched inserted above details by using cassandra-cli, it is pretty well clustered based on clustering key, in this example "name = John Smith" is clustering key.
The Cassandra database follows your first model, I think. A ColumnFamily is a collection of rows, which can contain any columns, in a sparse fashion (so each row can have different collection of column names, if desired). The number of columns allowed in a row is almost unlimited (2 billion in Cassandra v0.7).
A key point is that row keys must be unique within a column family, by definition - but can be re-used in other column families. So you can store unrelated data about the same key in different ColumnFamilies.
In Cassandra this matters because the data in a particular column family is stored in the same files on disk - so it is more efficient to place data items that are likely to be retrieved together, in the same ColumnFamily. This is partly a practical speed concern, but also a matter of organising your data into a clear schema. This touches upon your second definition - one might consider all the data about a particular key to be a "row", but partitioned by Column Family. However, in Cassandra it is not really a single row, because the data in one ColumnFamily can be changed independently of the data in other ColumnFamilies for the same row key.