I've been using mySQL for an app for some time, and the more data I collect, the slower it gets. So I have been looking into NOSQL options. One of the things I have in mySQL is a View created from a bunch of joins. The app shows all the important info in a grid, and the user can select ranges, do searches, etc. On this data set. Standard Query stuff.
Looking at Cassandra everything is already sorted based on the parameters I provide in my storage-conf.xml. So I would have a certain string as my key in the SuperColumn, and keep a bunch of the data in Columns below that. But I can only sort by one Column, and I can't do any real searching within the columns without pulling all the SuperColumns, and looping through the data, right?
I don't want to duplicate data across different ColumnFamilies, so I want to make sure Cassandra is appropriate for me. In Facebook, Digg, Twitter, they have plenty of searching functions, so maybe I am just not seeing the solution.
Is there a way with Cassandra for me to search for or filter specific data values in a SuperColumn, or its associated Column(s)? If not, is there another NOSQL option?
In the example below, it seems I can only query for phatduckk, friend1,John, etc. But what if I wanted to find anyone in the ColumnFamily that lived in city == "Beverley Hills"? Can it be done without returning all records? If so, could I do a search for city == "Beverley Hills" AND state == "CA"? It doesn't seem like I can do either, but I want to make sure and see what my options are.
AddressBook = { // this is a ColumnFamily of type Super
phatduckk: { // this is the key to this row inside the Super CF
friend1: {street: "8th street", zip: "90210", city: "Beverley Hills", state: "CA"},
John: {street: "Howard street", zip: "94404", city: "FC", state: "CA"},
Kim: {street: "X street", zip: "87876", city: "Balls", state: "VA"},
Tod: {street: "Jerry street", zip: "54556", city: "Cartoon", state: "CO"},
Bob: {street: "Q Blvd", zip: "24252", city: "Nowhere", state: "MN"},
}, // end row
ieure: {
joey: {street: "A ave", zip: "55485", city: "Hell", state: "NV"},
William: {street: "Armpit Dr", zip: "93301", city: "Bakersfield", state: "CA"},
},
}
Super family doesn't support secondary index but regular column family do. Using secondary index you can use the GetWhere statement.
Here is one example taken from one of my PHP projects:
This code use this Cassandra API : https://github.com/kallaspriit/Cassandra-PHP-Client-Libraryf
Note that since the question was asked, Cassandra added support for indexes automatically managed by the Cassandra system (I think since 0.8). That can answer the question for some people instead of managing your own index.
http://www.datastax.com/docs/1.1/dml/using_cli#indexing-a-column
This being said, I also wanted to mentioned that an SQL database, when it creates an index, duplicates a lot of your data to generate said index. It is still really cheap in Cassandra especially because you can dearly optimize it. The main problem is that you have to maintain coherency manually which SQL does for you transparently. But both mechanisms use exactly the same theoretical concept.
This is a bit like re-programming your own std::string with specializations that pertain to your application... (think of QString and CString for example!)
You cannot perform those kind of operations in Cassandra. There is a certain kinds of selection predicates that can be set on column-keys but nothing on the value that they hold. Look at the API and check get_slice/get_superslice and get_range query types. Again, all of this is concerning the keys in the ColumnFamily or SuperColumnFamily not the values.
If you want the kind of functionality that you have described then your best bet is a SQL database. Build proper indexes on your tables, especially on the columns that are most queried and you will see a big difference in the query performance. Hope this helps.
You "don't want to duplicate data across different ColumnFamilies," but that is how you do this kind of query in Cassandra. See http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/