Why is Django returning stale cache data?

2019-02-21 15:18发布

问题:

I have two Django models as shown below, MyModel1 & MyModel2:

class MyModel1(CachingMixin, MPTTModel):
    name = models.CharField(null=False, blank=False, max_length=255)
    objects = CachingManager()

    def __str__(self):
        return "; ".join(["ID: %s" % self.pk, "name: %s" % self.name, ] )

class MyModel2(CachingMixin, models.Model):
    name = models.CharField(null=False, blank=False, max_length=255)
    model1 = models.ManyToManyField(MyModel1, related_name="MyModel2_MyModel1")
    objects = CachingManager()

    def __str__(self):
        return "; ".join(["ID: %s" % self.pk, "name: %s" % self.name, ] )

MyModel2 has a ManyToMany field to MyModel1 entitled model1

Now look what happens when I add a new entry to this ManyToMany field. According to Django, it has no effect:

>>> m1 = MyModel1.objects.all()[0]
>>> m2 = MyModel2.objects.all()[0]
>>> m2.model1.all()
[]
>>> m2.model1.add(m1)
>>> m2.model1.all()
[]

Why? It seems definitely like a caching issue because I see that there is a new entry in Database table myapp_mymodel2_mymodel1 for this link between m2 & m1. How should I fix it??

回答1:

Is django-cache-machine really needed?

MyModel1.objects.all()[0]

Roughly translates to

SELECT * FROM app_mymodel LIMIT 1

Queries like this are always fast. There would not be a significant difference in speeds whether you fetch it from the cache or from the database.

When you use cache manager you actually add a bit of overhead here that might make things a bit slower. Most of the time this effort will be wasted because there may not be a cache hit as explained in the next section.

How django-cache-machine works

Whenever you run a query, CachingQuerySet will try to find that query in the cache. Queries are keyed by {prefix}:{sql}. If it’s there, we return the cached result set and everyone is happy. If the query isn’t in the cache, the normal codepath to run a database query is executed. As the objects in the result set are iterated over, they are added to a list that will get cached once iteration is done.

source: https://cache-machine.readthedocs.io/en/latest/

Accordingly, with the two queries executed in your question being identical, cache manager will fetch the second result set from memcache provided the cache hasn't been invalided.

The same link explains how cache keys are invalidated.

To support easy cache invalidation, we use “flush lists” to mark the cached queries an object belongs to. That way, all queries where an object was found will be invalidated when that object changes. Flush lists map an object key to a list of query keys.

When an object is saved or deleted, all query keys in its flush list will be deleted. In addition, the flush lists of its foreign key relations will be cleared. To avoid stale foreign key relations, any cached objects will be flushed when the object their foreign key points to is invalidated.

It's clear that saving or deleting an object would result in many objects in the cache having to be invalidated. So you are slowing down these operations by using cache manager. Also worth noting is that the invalidation documentation does not mention many to many fields at all. There is an open issue for this and from your comment on that issue it's clear that you have discovered it too.

Solution

Chuck cache machine. Caching all queries are almost never worth it. It leads to all kind of hard to find bugs and issues. The best approach is to optimize your tables and fine tune your queries. If you find a particular query that is too slow cache it manually.



回答2:

This was my workaround solution:

    >>> m1 = MyModel1.objects.all()[0]
    >>> m1
    <MyModel1: ID: 8887972990743179; name: my-name-blahblah>

    >>> m2 = MyModel2.objects.all()[0]
    >>> m2.model1.all()
    []
    >>> m2.model1.add(m1)
    >>> m2.model1.all()
    []

    >>> MyModel1.objects.invalidate(m1)
    >>> MyModel2.objects.invalidate(m2)
    >>> m2.save()
    >>> m2.model1.all()
    [<MyModel1: ID: 8887972990743179; name: my-name-blahblah>]


回答3:

Have you considered hooking into the model signals to invalidate the cache when an object is added? For your case you should look at M2M Changed Signal

Small example that doesn't solve your problem but relates the workaround you gave before to my signals solution approach (I don't know django-cache-machine):

def invalidate_m2m(sender, **kwargs):
    instance = kwargs.get('instance', None)
    action = kwargs.get('action', None)

    if action == 'post_add':
        Sender.objects.invalidate(instance)

m2m_changed.connect(invalidate_m2m, sender=MyModel2.model1.through)