django querysets + memcached: best practices

2019-02-04 10:40发布

问题:

Trying to understand what happens during a django low-level cache.set() Particularly, details about what part of the queryset gets stored in memcached.

First, am I interpreting the django docs correctly?

  • a queryset (python object) has/maintains its own cache
  • access to the database is lazy; even if the queryset.count is 1000, if I do an object.get for 1 record, then the dbase will only be accessed once, for that 1 record.
  • when accessing a django view via apache prefork MPM, everytime that a particular daemon instance X ends up invoking a particular view that includes something like "tournres_qset = TournamentResult.objects.all()", this will then result, each time, in a new tournres_qset object being created. That is, anything that may have been cached internally by a tournres_qset python object from a previous (tcp/ip) visit, is not used at all by a new request's tournres_qset.

Now the questions about saving things to memcached within the view. Let's say I add something like this at the top of the view:

tournres_qset = cache.get('tournres', None)
if tournres_qset is None:
    tournres_qset = TournamentResult.objects.all()
    cache.set('tournres', tournres_qset, timeout)
# now start accessing tournres_qset
# ...

What gets stored during the cache.set()?

  • Does the whole queryset (python object) get serialized and saved?

  • Since the queryset hasn't been used yet to get any records, is this just a waste of time, since no particular records' contents are actually being saved in memcache? (Any future requests will get the queryset object from memcache, which will always start fresh, with an empty local queryset cache; access to the dbase will always occur.)

  • If the above is true, then should I just always re-save the queryset at the end of the view, after it's been used throughout the vierw to access some records, which will result in the queryset's local cache to get updated, and which should always get re-saved to memcached? But then, this would always result in once again serializing the queryset object. So much for speeding things up.

  • Or, does the cache.set() force the queryset object to iterate and access from the dbase all the records, which will also get saved in memcache? Everything would get saved, even if the view only accesses a subset of the query set?

I see pitfalls in all directions, which makes me think that I'm
misunderstanding a whole bunch of things.

Hope this makes sense and appreciate clarifications or pointers to some "standard" guidelines. Thanks.

回答1:

Querysets are lazy, which means they don't call the database until they're evaluated. One way they could get evaluated would be to serialize them, which is what cache.set does behind the scenes. So no, this isn't a waste of time: the entire contents of your Tournament model will be cached, if that's what you want. It probably isn't: and if you filter the queryset further, Django will just go back to the database, which would make the whole thing a bit pointless. You should just cache the model instances you actually need.

Note that the third point in your initial set isn't quite right, in that this has nothing to do with Apache or preforking. It's simply that a view is a function like any other, and anything defined in a local variable inside a function goes out of scope when that function returns. So a queryset defined and evaluated inside a view goes out of scope when the view returns the response, and a new one will be created the next time the view is called, ie on the next request. This is the case whichever way you are serving Django.

However, and this is important, if you do something like set your queryset to a global (module-level) variable, it will persist between requests. Most of the ways that Django is served, and this definitely includes mod_wsgi, keep a process alive for many requests before recycling it, so the value of the queryset will be the same for all of those requests. This can be useful as a sort of bargain-basement cache, but is difficult to get right because you have no idea how long the process will last, plus other processes are likely to be running in parallel which have their own versions of that global variable.

Updated to answer questions in the comment

Your questions show that you still haven't quite grokked how querysets work. It's all about when they are evaluated: if you list, or iterate, or slice a queryset, that evaluates it, and it's at that point the database call is made (I count serialization under iterating, here), and the results stored in the queryset's internal cache. So, if you've already done one of those things to your queryset, and then set it to the (external) cache, that won't cause another database hit.

But, every filter() operation on a queryset, even one that's already evaluated, is another database hit. That's because it's a modification of the underlying SQL query, so Django goes back to the database - and returns a new queryset, with its own internal cache.