We are currently investigating the influence of using multiple column families on the performance of our bigtable queries. We found that splitting the columns into multiple column families does not increase the performance. Does anyone have had similar experiences?
Some more details about our benchmark setup. At this moment each row in our production table contains around 5 columns, each containing between 0,1 to 1 KB of data. All columns are stored into one column family. When performing a row key range filter (which returns on average 340 rows) and apply a column regex fitler (which returns only 1 column for each row), the query takes on average 23,3ms. We created some test tables where we increased the amount of columns/data per row by a factor 5. In test table 1, we kept everything in one column family. As expected this increased the query time of that same query to 40,6ms. In test table 2 we kept the original data in one column family, but the extra data was put into another column family. When querying the column family containing the original data (thus containing the same amount of data as the original table), the query time was on average 44,3ms. So the performance even decreased when using more column families.
This is exactly the opposite of we would have expected. E.g. this is mentioned in the bigtable docs ( https://cloud.google.com/bigtable/docs/schema-design#column_families)
Grouping data into column families allows you to retrieve data from a single family, or multiple families, rather than retrieving all of the data in each row. Group data as closely as you can to get just the information that you need, but no more, in your most frequent API calls.
Anyone with an explanation for our findings?
benchmark results
(edit: added some more details)
The content of a single row:
Table 1:
cf1
- col1
- col2
- ...
- col25
Table 2:
- cf1
- col1
- col2
- ..
- col5
- cf2
- col6
- col7
- ..
- col25
The benchmark we are executing is using the go client. The code that calls the API looks basically as follows:
filter = bigtable.ChainFilters(bigtable.FamilyFilter(request.ColumnFamily),
bigtable.ColumnFilter(colPattern), bigtable.LatestNFilter(1))
tbl := bf.Client.Open(table)
rr := bigtable.NewRange(request.RowKeyStart, request.RowKeyEnd)
err = tbl.ReadRows(c, rr, func(row bigtable.Row) bool {return true}, bigtable.RowFilter(filter))