I'm writing a server application which has to do a lot of MySQL queries per second triggered by connected clients and I'm wondering what would be the best way to improve performance apart from a good structured database.
My idea was to cache some of the data and just send the cached data instead of performing a new query. Just like a database has different tables the cache would have to deal with these different "structures" (objects). Since the clients connected to the server are indirectly able to alter the data I must be able to edit the cached data (and at some point be able to update my database but this shouldn't be too difficult). Here a short list of what I need to be able to do:
- edit the cached data/objects
- delete old data/objects and add new data/objects
- order the data by some kind of priority (last used)
- identify the data by some sort of id
I thought either a vector or a queue/priority_queue would be a good idea. This way I could create a queue or vector for every table/object I want to cache (didn't perform any tests since I wanted to get more opinions before I might waste my time). The largest object stored in these caching structures would be around 1 kilobyte (most likely smaller) and the smallest maybe 96 bytes.
Per caching structure I would not have to store more than 50.000 objects and I think I could work with 10 different structures (everyone for a different object type).
The most important part is speed otherwise I could just perform the queries. It's not just that I would have to make the queries but also create a new object afterwards instead of just reuse or resend the old object.
So here is my question:
- What would be the best way to cache the data/objects based on the provided information AND WHY?
Edit: Oh and when I mean structures I don't mean struct I just didn't know how I should refer to vector queue map etc. at the same time maybe container would have been better :).
There are many thing to consider, but in general I would base relational mapping in your case on the Row Data Gateway pattern (RDG). If you don't have too many different object types, this approach to architecture should scale good enough. RDG should facilitate your caching implementation if you constrain cache book-keeping to the Finder class.
If you have the time and will, check out the Patterns of Enterprise Application Architecture from Martin Fowler. It's a well of good information.
Now to the specifics...
Typically you would use some auto-incremented integer column in the database for this. You can use unordered_map to pull those objects from the cache quickly. Since you have all the objects in your cache, for the sake of optimization, you could also implement some of the
find*
functions to search the cache first. You can use unordered_map/unordered_multimap to 'index' some of the data, if your search time is highly restricted, or just stick to the good old map/multimap. However, this is doubling the work, and you already have it for free in the database these kinds of queries.Dirty data shouldn't be visible to the rest of the system until you actually write it to the database. Once you kick the update, and if all goes as intended, you can either replace the object in cache with the one you used for update, or simply delete the object in cache and let other readers pick it up from the database (which will result in caching the object again). You can implement this by cloning the original Gateway object, but the bottom-line is that you should have some locking strategy implemented.
Here you simply delete object from cache, and try to delete from the database. If deletion fails in the database, other readers will cache it. Just make sure that no client can access the same record while you're in the process od deletion. When adding new records, you simply instantiate Gateway object, pass it to the domain level object, and when you're done with the changes, call insert on the Gateway object. You can either put the new Gateway object to the cache, or simply let the first reader to put it into the cache.
This is a matter of selecting the best caching algorithm. This is not an easy question to answer, but LRU should work just fine. Without actual metrics, there is no right answer, but LRU is simple to implement and if it doesn't stand to your requirements, just do the metrics and decide on a new algorithm. Make sure that you can do that seamlessly by having a good interface to the cache. One other thing to have in mind is that your domain level objects should never depend on the limits of your cache. If you need 100k objects, but you have only 50k cache , you're still having all 100k objects in memory, but 50k of them are in the cache. In other words, your objects should not depend on the state of your cache, and also should not care if you have caching at all.
Next, if you're still baring with the idea of RDG, you are simply caching Gateway object in your cache. You can keep instances of the Gateway objects in your cache by means of shared_ptr, but should also consider you locking strategy (optimistic vs pessimistic), if you want to avoid dirty writes. Also, all your Gateways (one for every table), can inherit the same interface, so you can generalize your save/load strategies, and also, you would be able to use single pool while keeping things simple. (Check out boost::pool. Maybe it can help you with the cache implementation.)
One final point:
The cake is a lie! :D No matter what you decide to do, make sure that it is based on a decent amount of performance metrics. If you improve performance by 20%, and you spent 2 months doing it, maybe it is a worthwhile to think about putting few more gigs of RAM to your hardware. Make some easy verifiable proof of concept, which will give you enough info whether implementing your cache pays up, and if not, try some of the tested and reliable solutions off the shelf (memcached or such, as @Layne already commented).