Ive recently implemented memcache on my site which has been under heavy mysql load (mysql was as optimized as I could make it). It solved all my load issues, and site is running beautifully.
The problem that Im facing now is stale cached values. I have 1hr auto expiration times on most pages, and Im also deleting the key when the value in the DB chnages, but Im having difficulty keeping track and efficiently clearing out all the keys.
On some pages, its trivial. I can make the key be item_id (item_4653 for example), and when the data for it is updated, or the item is deleted, the key is cleaned out.
But on most pages, I take the script filename + querystring, md5 it, and use that as the key in memcache. This is especially useful for complex urls (which are very common).
For example I have the following page loaded.
index.php?search_keywords=good&search_section=1&sort=release&page=2
It will contain a list of items, which will be fetched from memcache. Another user then submits an item, which has "good" in its title, and it happens to be in the range of values, where it would appear on page 2, except it will not appear there, until the cache is refreshed. What makes this even more complicated, is that the newly added item will also appear on index.php?sort=newest, as well as index.php?category=some_category?page=1 and etc. Each 1 of those will have a unique key (md5 of the script name + query string).
So the newly added item might appear on dozens of pages, if they were fetched from a live DB, but it wont be visible on any of them until the stale cache is updated. The only option is to wait for the item to expire automatically.
This problem becomes even more pronounced on my forum (custom coded), where the values HAVE to be updated on demand, for all possible cached page combinations. Lets say I have 4 page thread, and I notice 3 spam posts on page 2. After deleting them, page 2 is rebuilt, but then it also has to rebuild pages 3 and 4, otherwise there will be duplicate posts on newly rebuild page 2, and old page 3. Thats just 1 example to..... there are dozens of these scenarios.
Any ideas?
cache invalidation is a big problem
I will give you a few ideas that will lead you to full solution as there is no genral solution for all use case..
You may benefit from a simpler naming scheme for your memcached keys - so they are easier to delete. Seems like with the MD5 solution, you might be creating too many keys for things which generally show the same data.
You might also consider a shorter cache time, like 20 minutes?
Also - how many items per page are you retrieving for each of these search result pages? If you have a paginated search - getting 50 items from the server shouldn't be too intensive.
You may have tuned the mysql server, but have you tuned the queries (improving them by examining the EXPLAIN output), or table structures (by adding useful indexes)?
I'm also wondering how intense the queries on those pages are. Do you join several tables? You may benefit from doing a simpler query - or a few queries (outlined below).
Alternatively - For each row in the result, do you run another query - or several? You may benefit from a slightly more complex search query that avoids you having to do the nested queries. Or, are you being bitten by an ORM library which does the same thing, runs a search, then queries for sub items on each iteration?
The 'a few simpler queries' solution - say for example - if you've got an item, and want to know it's category in the result set...
In stead of this:
This is a simple example - but say there were categories, and several other JOINs involved.
You might go this route:
Then do your search but without all of the JOINS, just from the items table with your where clauses, and in the output say...
It's a little ghetto, but maybe this - and the other suggestions - will help.
Since you are caching entire pages in memcached, your pages can't share cached data from the database with each other. Say I have page1.php and page2.php, with page1 and page2 as keys in memcached. Both pages display items. I add a new item. Now I have to expire page1 and page2.
Instead, I could have an items key in memcached, that page1.php and page2.php both use to display items. When I add a new item, I expire the items key (or better, update it's value), and both page1.php and page2.php are up-to-date.
If you still want to cache the entire page, you could add information to your keys that will change when data being cached changes (this wouldn't make sense if the data changes too often). For instance:
This way you can look up the timestamp of the newest item, an inexpensive query, and build your cache key with it. Once a newer item is added, the cache key will change, automatically expiring. This method means you still have to hit the database to see what the newest item's timestamp is, every time.
What you could do to make sure that your cache is always up to date without doing lots of changes to your code is work with a "version cache". This does increase the number of memcache requests you will make, but this might be a solution for you.
Another good thing about this solution is that you can set expiration time to never expire.
The idea is to basically have a version number stored in memcache for in your case a certain keyword (per keywork, not combination). How to use this?
When someone submits a new item:
if(!Memcache:increment("version_" + keyword)) {Memcache:set("version_" + keyword);}
When someone executes a query:
This ensures that as soon as a keyword has new results (or less when deleting), the version will be bumped and as such all related memcache queries.
Cache always up to date and queries can potentially stay longer than 1 hour in the cache.
Memcached::set has an expire parameter. Perhaps you can let this default to an hour, but for the pages that return search results - or in your forum, you can set this to a shorter period of time.
Couple simple things you can do:
First, if you really want to use the query string as a cache key, make it more deterministic and predictable. I'd do this by sorting the query string, e.g, :
?zed=7&alpha=1
is transformed to?alpha=1&zed=7
. Also strip out variables that aren't relevant to the caching key.To handle the problem of the ?page parameter, and items not showing up because the cache hasn't refreshed, I've got a couple ideas:
Folke's idea of adding a 'version' to the cache key would work well. The same trick is used to easily make links like unvisited.
Another approach would be to store the number of pages in the cache value, and then, when the database is updated, iterate through the cache keys.
Whether this is a good idea or not depends on how many pages there are, and the chance of updates coming in while the loop is running.
A third idea is to cache the entire result set instead of just that page of results. This may or may not be an option depending up on the size of the result set. When that result set is updated, you just flush the cache for that keyword.
A fourth idea is to change your caching backend and use something built to handle this situation. I dunno what other cache servers are out there, so you'll have to look around.
Finally, to supplement all this, you can try and be smarter about the expire time on cache entries. e.g., use the mean time between updates, or the number of queries per second for the keyword, etc.